当前位置:网站首页>002: what are the characteristics of the data lake
002: what are the characteristics of the data lake
2022-06-12 09:36:00 【YoungerChina】
I think we can further analyze the characteristics of the data lake from the two levels of data and calculation .
1. Characteristics of data
“ Fidelity ”. In the data lake, a copy of the data in the business system will be stored “ As like as two peas ” A full copy of . What's different from a data warehouse is , A copy of the original data must be kept in the data lake , Whether it's data format 、 Data patterns 、 Data content should not be modified . In this regard , Data Lake emphasizes business data “ original ” The preservation of the . meanwhile , The data lake should be able to store any type / Formatted data .
“ flexibility ”: A point in the above table is “ Write type schema” v.s.“ Read type schema”, In fact, it's essentially data schema The question of which stage of design takes place . For any data application , Actually schema It's necessary to design all kinds of things , Even if it's mongoDB And so on “ Modeless ” The database of , In its best practice, it is still recommended that records should be the same as far as possible / Similar structures .“ Write type schema” The underlying logic is that data is written before , It is necessary to determine the data access mode according to the access mode of the business schema, And then according to the established schema, Complete the data import , The benefit is the good adaptation of data and business ; But it also means that the upfront cost of ownership will be higher , Especially when the business model is not clear 、 When the business is still in the exploratory stage , The flexibility of data warehouse is not enough . The data Lake emphasizes “ Read type schema”, The underlying logic is that business uncertainty is the norm : We can't anticipate changes in our business , So we're going to be flexible , Delay the design , Let the entire infrastructure have the data “ On demand ” The ability to fit the business . therefore , Personally think that “ Fidelity ” and “ flexibility ” It's in the same vein : Since there is no way to predict the change of business , Then simply keep the data in the most original state , When needed , The data can be processed according to the demand . therefore , Data lake is more suitable for innovative enterprises 、 Business is changing rapidly . meanwhile , Users of data lake also have higher requirements , Data scientist 、 Business Analyst ( With certain visualization tools ) It's the target customer of data Lake .
“ Manageable ”: Data lake should provide perfect data management capability . Since the data requirements “ Fidelity ” and “ flexibility ”, Then there will be at least two types of data in the data Lake : Raw data and processed data . The data in the data lake will continue to accumulate 、 Evolution . therefore , The ability of data management will also be very high , At least the following data management capabilities should be included : data source 、 Data connection 、 data format 、 data schema( library / surface / Column / That's ok ). meanwhile , Data lake is a single enterprise / Unified data storage place in the organization , therefore , You also need to have certain permission management ability .
“ Traceability ”: Data lake is an organization / The storage place of full data in the enterprise , Need to manage the whole life cycle of data , Including the definition of data 、 Access 、 Storage 、 Handle 、 analysis 、 The whole process of application . A powerful data Lake implementation , You need to be able to access any data between them 、 Storage 、 Handle 、 The consumption process is traceable , It can clearly reproduce the complete generation process and flow process of data .
2. Computational features
Personally, I think the data lake has a wide range of requirements for computing power , It all depends on the business requirements for computing .
Rich computing engine . From batch 、 Flow computation 、 Interactive analysis to machine learning , All kinds of computing engines belong to the scope of data Lake . In general , Data loading 、 transformation 、 Processing uses a batch computing engine ; The part that needs real-time calculation , Can use a streaming computing engine ; For some exploratory analysis scenarios , It may be necessary to introduce an interactive analysis engine . With the combination of big data technology and artificial intelligence technology more and more closely , All kinds of machine learning / Deep learning algorithms have also been introduced , for example TensorFlow/PyTorch The framework already supports from HDFS/S3/OSS Read the sample data for training . therefore , For a qualified data Lake project , Scalability of computing engine / Pluggable , It should be a kind of basic ability .
Multimodal storage engine . Theoretically , The data lake itself should have a built-in multimodal storage engine , To meet the data access needs of different applications ( Considering the response time / Concurrent / Visit Frequency / Cost, etc ). however , In actual use , The data in the data lake is usually not accessed at high frequency , And most of the related applications are exploratory data applications , In order to achieve acceptable cost performance , Data Lake construction usually selects relatively cheap storage engines ( Such as S3/OSS/HDFS/OBS), And work with external storage engines when needed , Meet diverse application needs .
边栏推荐
- Difference between MySQL unreal reading and non repeatable reading
- [cloud native] what exactly does it mean? This article shares the answer with you
- 简单介绍线程和进程区别
- NiO principle
- 2026年中国软件定义存储市场容量将接近45.1亿美元
- 2022 极术通讯-安谋科技纷争尘埃落定,本土半导体产业基石更稳
- Ceil, floor and round functions
- 科创人·神州数码集团CIO沈旸:最佳实践模式正在失灵,开源加速分布式创新
- 哈希表的线性探测法代码实现
- ADB命令集锦,一起来学吧
猜你喜欢

Ceph性能优化与增强

Distributed transaction solution 1: TCC (compensation transaction)
Test case and bug description specification reference
Share the basic knowledge of software testing and write something you don't know

Crazy temporary products: super low price, big scuffle and new hope

Is it necessary to separate databases and tables for MySQL single table data of 5million?
测试用例和bug描述规范参考
![[cloud native] establishment of Eureka service registration](/img/da/0a700081be767db91edd5f3d49b5d0.png)
[cloud native] establishment of Eureka service registration
数据库常见面试题都给你准备好了
Selenium interview question sharing
随机推荐
Introduction to applet
Database common interview questions are ready for you
C#入门系列(十二) -- 字符串
Network layer IP protocol ARP & ICMP & IGMP nat
MySQL index
After receiving the picture, caigou was very happy and played with PDF. The submission format was flag{xxx}, and the decryption characters should be in lowercase
How to write test cases?
Selenium面试题分享
PandoraBox 使用防火墙规则定义非上网时间
软件测试报告中常见的疏漏,给自己提个醒
The Dragon Boat Festival is in good health -- people are becoming more and more important in my heart
Autojs学习笔记6:text(txt).findOne()切换app时会报错,最后解决实现效果,切换任何app直到脚本找到指定的txt文字的控件进行点击。
Record and store user video playback history selection
ADB command collection, let's learn together
Auto.js学习笔记9:脚本引擎使用,启动指定路径脚本文件和关闭等基础方法
7-13 地下迷宫探索(邻接表)
Mysql database ignores case
Cas d'essai et spécification de description des bogues référence
链式哈希表
ADB命令集锦,一起来学吧