当前位置：网站首页>002: what are the characteristics of the data lake

002: what are the characteristics of the data lake

2022-06-12 09:36:00 【YoungerChina】

I think we can further analyze the characteristics of the data lake from the two levels of data and calculation .

1. Characteristics of data

“ Fidelity ”. In the data lake, a copy of the data in the business system will be stored “ As like as two peas ” A full copy of . What's different from a data warehouse is , A copy of the original data must be kept in the data lake , Whether it's data format 、 Data patterns 、 Data content should not be modified . In this regard , Data Lake emphasizes business data “ original ” The preservation of the . meanwhile , The data lake should be able to store any type / Formatted data .
“ flexibility ”： A point in the above table is “ Write type schema” v.s.“ Read type schema”, In fact, it's essentially data schema The question of which stage of design takes place . For any data application , Actually schema It's necessary to design all kinds of things , Even if it's mongoDB And so on “ Modeless ” The database of , In its best practice, it is still recommended that records should be the same as far as possible / Similar structures .“ Write type schema” The underlying logic is that data is written before , It is necessary to determine the data access mode according to the access mode of the business schema, And then according to the established schema, Complete the data import , The benefit is the good adaptation of data and business ; But it also means that the upfront cost of ownership will be higher , Especially when the business model is not clear 、 When the business is still in the exploratory stage , The flexibility of data warehouse is not enough . The data Lake emphasizes “ Read type schema”, The underlying logic is that business uncertainty is the norm ： We can't anticipate changes in our business , So we're going to be flexible , Delay the design , Let the entire infrastructure have the data “ On demand ” The ability to fit the business . therefore , Personally think that “ Fidelity ” and “ flexibility ” It's in the same vein ： Since there is no way to predict the change of business , Then simply keep the data in the most original state , When needed , The data can be processed according to the demand . therefore , Data lake is more suitable for innovative enterprises 、 Business is changing rapidly . meanwhile , Users of data lake also have higher requirements , Data scientist 、 Business Analyst （ With certain visualization tools ） It's the target customer of data Lake .
“ Manageable ”： Data lake should provide perfect data management capability . Since the data requirements “ Fidelity ” and “ flexibility ”, Then there will be at least two types of data in the data Lake ： Raw data and processed data . The data in the data lake will continue to accumulate 、 Evolution . therefore , The ability of data management will also be very high , At least the following data management capabilities should be included ： data source 、 Data connection 、 data format 、 data schema（ library / surface / Column / That's ok ）. meanwhile , Data lake is a single enterprise / Unified data storage place in the organization , therefore , You also need to have certain permission management ability .
“ Traceability ”： Data lake is an organization / The storage place of full data in the enterprise , Need to manage the whole life cycle of data , Including the definition of data 、 Access 、 Storage 、 Handle 、 analysis 、 The whole process of application . A powerful data Lake implementation , You need to be able to access any data between them 、 Storage 、 Handle 、 The consumption process is traceable , It can clearly reproduce the complete generation process and flow process of data .

2. Computational features

Personally, I think the data lake has a wide range of requirements for computing power , It all depends on the business requirements for computing .

Rich computing engine . From batch 、 Flow computation 、 Interactive analysis to machine learning , All kinds of computing engines belong to the scope of data Lake . In general , Data loading 、 transformation 、 Processing uses a batch computing engine ; The part that needs real-time calculation , Can use a streaming computing engine ; For some exploratory analysis scenarios , It may be necessary to introduce an interactive analysis engine . With the combination of big data technology and artificial intelligence technology more and more closely , All kinds of machine learning / Deep learning algorithms have also been introduced , for example TensorFlow/PyTorch The framework already supports from HDFS/S3/OSS Read the sample data for training . therefore , For a qualified data Lake project , Scalability of computing engine / Pluggable , It should be a kind of basic ability .
Multimodal storage engine . Theoretically , The data lake itself should have a built-in multimodal storage engine , To meet the data access needs of different applications （ Considering the response time / Concurrent / Visit Frequency / Cost, etc ）. however , In actual use , The data in the data lake is usually not accessed at high frequency , And most of the related applications are exploratory data applications , In order to achieve acceptable cost performance , Data Lake construction usually selects relatively cheap storage engines （ Such as S3/OSS/HDFS/OBS）, And work with external storage engines when needed , Meet diverse application needs .

原网站

版权声明
本文为[YoungerChina]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/163/202206120921251020.html

当前位置：网站首页>002: what are the characteristics of the data lake

002: what are the characteristics of the data lake

1. Characteristics of data

2. Computational features

边栏推荐

猜你喜欢

随机推荐