当前位置:网站首页>Data Lake (VII): Iceberg concept and review what is a data Lake
Data Lake (VII): Iceberg concept and review what is a data Lake
2022-07-05 13:34:00 【51CTO】
Iceberg Concept and review what a data lake is
One 、 Review what a data lake is
Data lake is a centralized repository , Allows you to store multiple sources at any size 、 All structured and unstructured data , Data can be stored as is , No need to process structured data , And run different types of analysis , Process the data , for example : Big data processing 、 Real time analysis 、 machine learning , To guide better decisions .
Two 、 Why does big data need data lake
Currently based on Hive Our offline data warehouse is very mature , It is very troublesome to update the record level data in the traditional offline data warehouse , The entire partition to which the data to be updated belongs , Even the whole table can be covered completely , Due to the architecture design of multi-level layer by layer processing of off-line data warehouse , When updating data, it also needs to be reflected layer by layer from the paste source layer to the subsequent derived tables .
With the continuous development of real-time computing engine and business demand for real-time report output continues to expand , In recent years, the industry has been focusing on and exploring the construction of real-time warehouse . According to the evolution process of data warehouse architecture , stay Lambda The architecture includes two links: offline processing and real-time processing , The architecture is shown below :
edit
It is precisely because the two links process data, resulting in data inconsistency and other problems, so there is Kappa framework ,Kappa The structure is as follows :
edit
Kappa Architecture can be called real-time data warehouse , At present, the most commonly used implementation in the industry is Flink + Kafka, However, based on Kafka+Flink The real-time data warehouse scheme also has several obvious defects , Therefore, in many enterprises, hybrid architecture is often used in the construction of real-time data warehouse , Not all businesses adopt Kappa Implementation of real-time processing in the architecture .Kappa The architecture defects are as follows :
- Kafka Can't support massive data storage . For lines of business with massive amounts of data ,Kafka Generally, it can only store data for a very short time , Like the last week , Even the last day .
- Kafka Can't support efficient OLAP Inquire about , Most businesses want to be in DWD\DWS Layer supports ad hoc queries , however Kafka It's not very friendly to support such a demand .
- It is impossible to reuse the mature data consanguinity based on offline data warehouse 、 Data quality management system . We need to re implement a set of data consanguinity 、 Data quality management system .
- Kafka I won't support it update/upsert, at present Kafka Support only append.
In order to solve Kappa The pain point of Architecture , The mainstream in the industry is to adopt “ Batch flow integration ” The way , Here, the integration of batch and stream can be understood as the use of batch and stream SQL Same processing , It can also be understood as the unification of processing framework , for example :Spark、Flink, But what's more important here is the unity on the storage layer , As long as you do it at the storage level “ Batch flow integration ” Can solve the above Kappa All kinds of problems . Data Lake technology can well realize the storage level “ Batch flow integration ”, This is why data lake is needed in big data .
3、 ... and 、Iceberg Concept and characteristics
1、 Concept
Apache Iceberg It is an open table format for large-scale data analysis scenarios (Table Format).Iceberg Use a method similar to SQL High performance table format for tables ,Iceberg The format form table can store tens of PB data , adapter Spark、Trino、PrestoDB、Flink and Hive And other computing engines provide high-performance read-write and metadata management functions ,Iceberg Is a data Lake solution .
Be careful :Trino It's the original PrestoSQL ,2020 year 12 month 27 Japan ,PrestoSQL The project was renamed Trino,Presto Split into two branches :PrestoDB、PrestorSQL.
2、 characteristic
Iceberg Very lightweight , It can be used as lib And Spark、Flink To integrate
Iceberg Official website : https://iceberg.apache.org/
Iceberg It has the following characteristics :
- Iceberg Support real-time / Batch data writing and reading , Support Spark/Flink Calculation engine .
- Iceberg Support transactions ACID, Support adding 、 Delete 、 Update data .
- Do not bind any underlying storage , Support Parquet、ORC、Avro The format is compatible with row storage and column storage .
- Iceberg Support hidden partition and partition change , Facilitate business data partition strategy .
- Iceberg Support repeated query of snapshot data , With version rollback function .
- Iceberg The scan plan is fast , Reading tables or querying files can be done without the need for distributed SQL engine .
- Iceberg Efficiently filter queries through table metadata .
- Concurrency support based on optimistic locking , Provide multi-threaded concurrent writing capability and ensure linear consistency of data .
边栏推荐
- 网络安全-HSRP协议
- 先写API文档还是先写代码?
- 多人合作项目查看每个人写了多少行代码
- AVC1与H264的区别
- How to choose note taking software? Comparison and evaluation of notion, flowus and WOLAI
- MATLAB论文图表标准格式输出(干货)
- Flutter draws animation effects of wave movement, curves and line graphs
- 53. Maximum subarray sum: give you an integer array num, please find a continuous subarray with the maximum sum (the subarray contains at least one element) and return its maximum sum.
- Nantong online communication group
- STM32 reverse entry
猜你喜欢
[deep learning paper notes] hnf-netv2 for segmentation of brain tumors using multimodal MR imaging
Backup and restore of Android local SQLite database
ASEMI整流桥HD06参数,HD06图片,HD06应用
Lb10s-asemi rectifier bridge lb10s
Record in-depth learning - some bug handling
Sorry, we can't open xxxxx Docx, because there is a problem with the content (repackaging problem)
运筹说 第68期|2022年最新影响因子正式发布 快看管科领域期刊的变化
Talk about seven ways to realize asynchronous programming
What is a network port
Changing JS code has no effect
随机推荐
UnicodeDecodeError: ‘utf-8‘ codec can‘t decode byte 0xe6 in position 76131: invalid continuation byt
Don't know these four caching modes, dare you say you understand caching?
Android本地Sqlite数据库的备份和还原
leetcode 10. Regular Expression Matching 正则表达式匹配 (困难)
华为推送服务内容,阅读笔记
mysql获得时间
Could not set property ‘id‘ of ‘class XX‘ with value ‘XX‘ argument type mismatch 解决办法
什么是网络端口
Can and can FD
Flutter 3.0更新后如何应用到小程序开发中
Solve the problem of "unable to open source file" xx.h "in the custom header file on vs from the source
Idea设置方法注释和类注释
Notion 类笔记软件如何选择?Notion 、FlowUs 、Wolai 对比评测
go map
[deep learning paper notes] hnf-netv2 for segmentation of brain tumors using multimodal MR imaging
【MySQL 使用秘籍】一网打尽 MySQL 时间和日期类型与相关操作函数(三)
【MySQL 使用秘籍】一網打盡 MySQL 時間和日期類型與相關操作函數(三)
多人合作项目查看每个人写了多少行代码
Cloudcompare - point cloud slice
APICloud Studio3 API管理与调试使用教程