当前位置:网站首页>Why is cloud native data Lake worth paying attention to?
Why is cloud native data Lake worth paying attention to?
2022-06-09 18:58:00 【51CTO】

Recent years ,“ Data Lake ” It's being mentioned by more and more people , Although the definition of data lake is not uniform , However, this does not prevent enterprises from launching into practice one after another , Like Amazon 、 Ali 、 tencent 、 HUAWEI, etc. , Are exploring the construction of a data Lake .
Why many enterprises are “ What is data lake ” In the absence of a consensus on , Started to enter the field of data lake ?
One possible reason is , The data lake will represent big data and AI A new system integrating storage and computing in the new era , Especially after the combination of data lake and cloud , This trend is more obvious .
To understand this , Let's start with the development of the data lake .
PART 01
The rise of data lake
2010 year Pentaho James, founder and chief technology officer of the company · Dixon (James Dixon) The concept of data lake is put forward for the first time . James · Dixon compared the data in the data lake to the original ecological water —— It's unprocessed , Retaining the original structure .
Data lake water flows from the source into the lake , All kinds of users can come to the lake to get 、 Distill the water ( data ). therefore , In the early days, the industry and users defined the data Lake as a centralized system for storing data in original format , Can store structured data of any size 、 Semi structured 、 Unstructured and binary data .
With the integration and development of big data technology , The boundary of the data lake is expanding , The connotation changes constantly , It has gradually evolved into a unified storage of multi-source heterogeneous data 、 A comprehensive big data solution for multi paradigm computing analysis and unified management .
This makes the data lake very different from the data warehouse .
Data warehouse was born in the era of database , Its core idea is to convert a large number of data in the database into a certain format , Periodically copy to another database for columnar storage , So as to meet the needs of enterprise query and data analysis .
in the past , The data of many enterprises are ERP、CRM Data based , The data scale is often TB level , Enterprises usually adopt data warehouse solutions locally to store and analyze data . But the model paradigm of data warehouse is fixed , The underlying data cannot be varied .
With the development of the Internet , The amount of data has skyrocketed , More and more unstructured data , Enterprise business is changing faster and faster , Digital transformation has become IT Hot spots in the industry , Data needs deeper value mining , Therefore, it is necessary to ensure that the original information retained in the data is not lost , Respond to the changing needs of the future .
The traditional data warehouse can not meet the real-time demand of enterprises in the era of big data 、 Interactive analysis, etc , The data Lake chose “ The front is loose and the back is tight ” Design idea , The initialization phase discards strict patterns , After schme, To gain greater flexibility , At the same time, data consistency and performance are ensured through unified storage and calculation optimization , This makes the data lake gradually attract attention in the big data field .
Up to now , The data lake is no longer limited to one technology 、 A software product , It covers data storage 、 Data Lake calculation 、 Data Lake AI Diversified data architecture , Meet the production management needs of enterprise users .
PART 02
Why data lake and cloud native are the best combination
With Oracle The database middleware represented by is more and more difficult to meet the data processing requirements brought about by the rapid change of enterprise business ,IT The industry continues to produce new computing engines , such as , Enterprises began to build their own open source projects Hadoop Data Lake Architecture , The original data is stored in HDFS On the system , The engine Hadoop and Spark Open source ecology , Storage and computing are integrated .
However, the disadvantage of this architecture is that the enterprise needs to operate and manage the entire cluster itself , High cost and poor cluster stability .
under these circumstances , Hosting on the cloud Hadoop Data Lake Architecture ( namely EMR Open source data Lake ) emerge as the times require . The underlying physical servers and open source software versions are provided and managed by cloud vendors , The data is still stored in HDFS On the system , The engine Hadoop and Spark Open source ecology .
This architecture goes through the cloud IaaS Layers enhance the elasticity and stability of the machine layer , So that the overall operation and maintenance cost of the enterprise has decreased , But companies still need to be aware of HDFS System and service running state management and governance , That is, the operation and maintenance of the application layer .
Because storage and computing are coupled , Stability is not optimal , The two resources cannot be expanded independently , The cost of use is not optimal .
meanwhile , Limited by the ability of open source software itself , Traditional data Lake technology can not meet the needs of enterprise users in data scale 、 Storage costs 、 Query performance and elastic computing architecture upgrade , It also fails to achieve the ideal goal of the data Lake architecture .
and Cloud computing can make the data lake play the greatest value and role . Cloud computing is extremely flexible 、 Elastic and scalable computing storage resources , Make data storage 、 Analysis and application become extremely easy ; The biggest value of the data lake lies in the unified aggregation of data in various formats within the enterprise , Perform multiple analyses on one piece of data , Cost effective and efficient data mining , especially The design idea of data Lake naturally fits with cloud computing , So from 2010 Since the concept of data lake was put forward in , Cloud service providers are an important driver for the implementation of the data Lake concept .
With the arrival of cloud primary era , When the data lake is native to the cloud ( When it comes to cloud primordial , The first reaction of many people is Serverless、 Containerize these concepts , In fact, in recent years , The concept of cloud primordial is gradually generalized , It covers many products and services , To some extent , Cloud nativity is a design paradigm for distributed systems , This paradigm is flexible 、 Security 、 Stability, etc ) When deployed , Its powerful performance advantages can be maximized .
One side , After the data lake is put on the cloud, you can enjoy the performance improvement brought by the cloud itself , If available ( Compared with self built IDC, The cloud environment has more resource redundancy , When a node fails, it can seamlessly switch to other nodes , Ensure business continuity )、 elastic ( Cloud computing has dynamic scalability and affordability , It can solve the throughput and IO Performance bottleneck , Meet the huge scale and sudden nature of the resources required for big data analysis )、 agile ( Cloud enables enterprises to repeat 、 Complex bottom layer IT Release from work , At the same time, it is modular 、 The loosely coupled agile architecture is conducive to the rapid iteration of data products 、 Deploy 、 Operation and maintenance and innovation ).
On the other hand , The data lake can do more performance optimization in the cloud native environment , Such as the analysis acceleration brought by rich context , The fusion of churn processing and batch processing brings real-time data value release , Security and quality improvement brought about by one-stop data management scheme .
This allows enterprises to make effective use of the public cloud infrastructure , The data Lake platform also has more technical options , For example, pure managed storage systems on the cloud are gradually replacing HDFS, Become the storage infrastructure of the data Lake , And the richness of the engine is expanding .
All in all , Cloud specific “ Pooling 、 elastic 、 agile ” Other characteristics , Let many assumptions of data layer and application layer be realized , Embracing cloud Nativity has become an inevitable choice for data lake and even big data .
PART 03
Looking forward to the future of cloud native data Lake
If we make a summary of cloud native data Lake , Cloud native data lake is a new technology product developed by big data computing platform with the help of cloud computing theory , It supports flexible storage of heterogeneous data 、 Elastic scaling of computing resources , It can help enterprises cope with the increasingly complex data structure 、 The business environment with increasingly high requirements for timeliness of data processing .
in other words , Cloud native data lake is just an architectural principle , There are many ways to implement it , be based on EMR It can be used as a cloud native data lake , be based on Flink It can also be a cloud native data lake .
But here's the thing , Although China's data Lake technology is developing and breaking through year by year , Public cloud vendors and other vendors are trying to , However, there are barriers and difficulties in data perception collection and classification cleaning , Lack of experience in data Lake modeling . in general , The overall development of China's data Lake market is in its infancy , The technical route is not unified , The product capabilities in the industry are mixed .
From the current application situation , There are still many pain points in the landing of the data Lake in China .
Product level , The data governance capability and full link capability of the data Lake still need to be further strengthened .
In terms of data governance , Data governance requires that categories of data be included in the catalog 、 The rules , If the enterprise has insufficient control over the data Lake , It will lead to poor design of data Lake directory and overall architecture , The data in the lake is not fully archived or maintained , Easy to form data swamp . Due to lack of context metadata Association , Data swamp unable to retrieve data , As a result, users cannot effectively analyze and utilize data .
In terms of full link capability , At this stage, there are few domestic providers that can provide full link cloud native data Lake services , Most vendors only provide support for data Lake components , Therefore, downstream demand enterprises can only purchase multiple suppliers to meet their needs from data collection and management to analysis and visualization .
Application level , The industry cognition and talent training of cloud native data lake are relatively weak . Talent , Now big data 、AI The development of technology stack is changing with each passing day , Enterprises lack professional talents . From the internal point of view , Managers know little about data governance , If you blindly build a data lake without thoroughly combing the current business situation and needs of the enterprise , Pursue the concept of big and complete , It may lead to poor landing effect of the data lake . Industry cognition , Although the value attribute of data has been widely recognized in the industry , But the majority of enterprises still choose to wait and see , The data Lake still faces many challenges in cognition and promotion .
in addition , With the digital transformation of enterprises, it has entered the deep water area , Data has become the core production factor of enterprises , and One of the biggest risks of data lake is security and access control . A lot of data flows into the lake without any supervision , Once some data contains privacy and regulatory requirements that other data does not , Data leakage and loss may occur , Bring inestimable consequences .
Of course , Any industry in the early stage of development will have such problems , Some imperfections just mean that there is room for development in this industry . According to the analysis of iResearch , Because the national policy is favorable , For example, the state has successively introduced 《 Action plan for big data development 》《 National integrated big data center collaborative innovation system computing hub implementation plan 》 And other documents promote the big data industry to mature , And the driving force of the rapid development of Internet technology 、 Acceleration of enterprise digital transformation and other factors , It is expected that the cloud native data Lake market in China will be 39.7% The compound growth rate of .
therefore , The future development prospect of cloud native data Lake deserves our expectation and attention .
Reference material :
https://www.iresearch.com.cn/Detail/report?id=3972&isfree=0
边栏推荐
- C# 30. String interception
- KVM virtualization Fundamentals
- Talk about MQ technology selection
- Squeeze and exception networks learning notes
- 什么是集群?为什么要使用集群架构?
- 上位机开发(开篇)
- Scala basic grammar learning-1
- Gee | seeking improved version of NDVI
- Peter Drucker: what kind of teacher is a real teacher?
- C# 29. Textbox always shows the last line
猜你喜欢

Live broadcast Preview - deploy polardb for PostgreSQL cluster on alicloud ESSD cloud disk

用指纹做Windows双因素身份认证,既安全又方便

How to improve the click through rate of push messages through a/b testing?

Database: data field change under high-speed parallel distribution

mfc连接数据库显示未发现数据源名称并且未指定默认驱动程序

Principle and implementation of avoiding group panic and load balancing

2022 SME Digital Forum held 360 Enterprise Security Cloud release new upgrade

Resolve swap file SWP already exists problem

Will quic become a disruptor of Internet transmission?

2022年开什么实体店比较赚钱?适合女性做的小成本开店,叶其芳大健康
随机推荐
Is there any risk for CICC securities to open an account? Is it safe?
GCC compile demo+makefile use
【数据库数据恢复】windows server环境下SqlServer数据库文件未知原因丢失的数据恢复案例
Fastjon2他来了,性能显著提升,还能再战十年
技术分享 | Selenium多浏览器处理
Loop structure programming 2
Minicube deployment use
摩根大通期货这家公司怎么样?期货开户办理安全吗?
Leetcode: Sword finger offer 56 - I. number of occurrences in the array [grouping XOR]
示波器电流探头的消磁与平衡调节步骤
Notes on ad PCB drawing
精益产品开发体系最佳实践及原则
Introduction to Multivariate Statistics
C# 30. String interception
How to realize wireless monitoring of factory production energy consumption data?
How to measure differential signal with high voltage differential probe
聊聊 延时消息的 6种 实现方案
如何通过A/B测试提升Push推送消息点击率?
Talk about MQ technology selection
请问为什么不能够在DF的foreachPartition方法调用的函数方法中引用redis模块呢?