当前位置:网站首页>002: what are the characteristics of the data lake
002: what are the characteristics of the data lake
2022-06-12 09:36:00 【YoungerChina】
I think we can further analyze the characteristics of the data lake from the two levels of data and calculation .
1. Characteristics of data
“ Fidelity ”. In the data lake, a copy of the data in the business system will be stored “ As like as two peas ” A full copy of . What's different from a data warehouse is , A copy of the original data must be kept in the data lake , Whether it's data format 、 Data patterns 、 Data content should not be modified . In this regard , Data Lake emphasizes business data “ original ” The preservation of the . meanwhile , The data lake should be able to store any type / Formatted data .
“ flexibility ”: A point in the above table is “ Write type schema” v.s.“ Read type schema”, In fact, it's essentially data schema The question of which stage of design takes place . For any data application , Actually schema It's necessary to design all kinds of things , Even if it's mongoDB And so on “ Modeless ” The database of , In its best practice, it is still recommended that records should be the same as far as possible / Similar structures .“ Write type schema” The underlying logic is that data is written before , It is necessary to determine the data access mode according to the access mode of the business schema, And then according to the established schema, Complete the data import , The benefit is the good adaptation of data and business ; But it also means that the upfront cost of ownership will be higher , Especially when the business model is not clear 、 When the business is still in the exploratory stage , The flexibility of data warehouse is not enough . The data Lake emphasizes “ Read type schema”, The underlying logic is that business uncertainty is the norm : We can't anticipate changes in our business , So we're going to be flexible , Delay the design , Let the entire infrastructure have the data “ On demand ” The ability to fit the business . therefore , Personally think that “ Fidelity ” and “ flexibility ” It's in the same vein : Since there is no way to predict the change of business , Then simply keep the data in the most original state , When needed , The data can be processed according to the demand . therefore , Data lake is more suitable for innovative enterprises 、 Business is changing rapidly . meanwhile , Users of data lake also have higher requirements , Data scientist 、 Business Analyst ( With certain visualization tools ) It's the target customer of data Lake .
“ Manageable ”: Data lake should provide perfect data management capability . Since the data requirements “ Fidelity ” and “ flexibility ”, Then there will be at least two types of data in the data Lake : Raw data and processed data . The data in the data lake will continue to accumulate 、 Evolution . therefore , The ability of data management will also be very high , At least the following data management capabilities should be included : data source 、 Data connection 、 data format 、 data schema( library / surface / Column / That's ok ). meanwhile , Data lake is a single enterprise / Unified data storage place in the organization , therefore , You also need to have certain permission management ability .
“ Traceability ”: Data lake is an organization / The storage place of full data in the enterprise , Need to manage the whole life cycle of data , Including the definition of data 、 Access 、 Storage 、 Handle 、 analysis 、 The whole process of application . A powerful data Lake implementation , You need to be able to access any data between them 、 Storage 、 Handle 、 The consumption process is traceable , It can clearly reproduce the complete generation process and flow process of data .
2. Computational features
Personally, I think the data lake has a wide range of requirements for computing power , It all depends on the business requirements for computing .
Rich computing engine . From batch 、 Flow computation 、 Interactive analysis to machine learning , All kinds of computing engines belong to the scope of data Lake . In general , Data loading 、 transformation 、 Processing uses a batch computing engine ; The part that needs real-time calculation , Can use a streaming computing engine ; For some exploratory analysis scenarios , It may be necessary to introduce an interactive analysis engine . With the combination of big data technology and artificial intelligence technology more and more closely , All kinds of machine learning / Deep learning algorithms have also been introduced , for example TensorFlow/PyTorch The framework already supports from HDFS/S3/OSS Read the sample data for training . therefore , For a qualified data Lake project , Scalability of computing engine / Pluggable , It should be a kind of basic ability .
Multimodal storage engine . Theoretically , The data lake itself should have a built-in multimodal storage engine , To meet the data access needs of different applications ( Considering the response time / Concurrent / Visit Frequency / Cost, etc ). however , In actual use , The data in the data lake is usually not accessed at high frequency , And most of the related applications are exploratory data applications , In order to achieve acceptable cost performance , Data Lake construction usually selects relatively cheap storage engines ( Such as S3/OSS/HDFS/OBS), And work with external storage engines when needed , Meet diverse application needs .
边栏推荐
猜你喜欢
随机推荐
Auto.js学习笔记8:常用且重要的一些API
Ceil, floor and round functions
Implementation of hotspot reference
JVM garbage collection
Swagger documentation details
JVM virtual machine
基于 Ceph 对象存储的实战兵法
软件测试面试题精选
自动化测试学习路线,快来学吧
Dragon Boat Festival Ankang - - les Yankees dans mon cœur de plus en plus de zongzi
[cloud native] establishment of Eureka service registration
After receiving the picture, caigou was very happy and played with PDF. The submission format was flag{xxx}, and the decryption characters should be in lowercase
Is it necessary to separate databases and tables for MySQL single table data of 5million?
SQL basic syntax II
行业分析怎么做
软件测试面试官问这些问题的背后意义你知道吗?
小程序的介绍
Hotspot Metaspace
2022 pole technology communication - anmou technology ushers in new opportunities for development
I Regular expression to finite state automata: regular expression to NFA






