当前位置:网站首页>Data Lake (VII): Iceberg concept and review what is a data Lake
Data Lake (VII): Iceberg concept and review what is a data Lake
2022-07-05 13:34:00 【51CTO】
Iceberg Concept and review what a data lake is
One 、 Review what a data lake is
Data lake is a centralized repository , Allows you to store multiple sources at any size 、 All structured and unstructured data , Data can be stored as is , No need to process structured data , And run different types of analysis , Process the data , for example : Big data processing 、 Real time analysis 、 machine learning , To guide better decisions .
Two 、 Why does big data need data lake
Currently based on Hive Our offline data warehouse is very mature , It is very troublesome to update the record level data in the traditional offline data warehouse , The entire partition to which the data to be updated belongs , Even the whole table can be covered completely , Due to the architecture design of multi-level layer by layer processing of off-line data warehouse , When updating data, it also needs to be reflected layer by layer from the paste source layer to the subsequent derived tables .
With the continuous development of real-time computing engine and business demand for real-time report output continues to expand , In recent years, the industry has been focusing on and exploring the construction of real-time warehouse . According to the evolution process of data warehouse architecture , stay Lambda The architecture includes two links: offline processing and real-time processing , The architecture is shown below :

edit
It is precisely because the two links process data, resulting in data inconsistency and other problems, so there is Kappa framework ,Kappa The structure is as follows :

edit
Kappa Architecture can be called real-time data warehouse , At present, the most commonly used implementation in the industry is Flink + Kafka, However, based on Kafka+Flink The real-time data warehouse scheme also has several obvious defects , Therefore, in many enterprises, hybrid architecture is often used in the construction of real-time data warehouse , Not all businesses adopt Kappa Implementation of real-time processing in the architecture .Kappa The architecture defects are as follows :
- Kafka Can't support massive data storage . For lines of business with massive amounts of data ,Kafka Generally, it can only store data for a very short time , Like the last week , Even the last day .
- Kafka Can't support efficient OLAP Inquire about , Most businesses want to be in DWD\DWS Layer supports ad hoc queries , however Kafka It's not very friendly to support such a demand .
- It is impossible to reuse the mature data consanguinity based on offline data warehouse 、 Data quality management system . We need to re implement a set of data consanguinity 、 Data quality management system .
- Kafka I won't support it update/upsert, at present Kafka Support only append.
In order to solve Kappa The pain point of Architecture , The mainstream in the industry is to adopt “ Batch flow integration ” The way , Here, the integration of batch and stream can be understood as the use of batch and stream SQL Same processing , It can also be understood as the unification of processing framework , for example :Spark、Flink, But what's more important here is the unity on the storage layer , As long as you do it at the storage level “ Batch flow integration ” Can solve the above Kappa All kinds of problems . Data Lake technology can well realize the storage level “ Batch flow integration ”, This is why data lake is needed in big data .
3、 ... and 、Iceberg Concept and characteristics
1、 Concept
Apache Iceberg It is an open table format for large-scale data analysis scenarios (Table Format).Iceberg Use a method similar to SQL High performance table format for tables ,Iceberg The format form table can store tens of PB data , adapter Spark、Trino、PrestoDB、Flink and Hive And other computing engines provide high-performance read-write and metadata management functions ,Iceberg Is a data Lake solution .
Be careful :Trino It's the original PrestoSQL ,2020 year 12 month 27 Japan ,PrestoSQL The project was renamed Trino,Presto Split into two branches :PrestoDB、PrestorSQL.
2、 characteristic
Iceberg Very lightweight , It can be used as lib And Spark、Flink To integrate
Iceberg Official website : https://iceberg.apache.org/
Iceberg It has the following characteristics :
- Iceberg Support real-time / Batch data writing and reading , Support Spark/Flink Calculation engine .
- Iceberg Support transactions ACID, Support adding 、 Delete 、 Update data .
- Do not bind any underlying storage , Support Parquet、ORC、Avro The format is compatible with row storage and column storage .
- Iceberg Support hidden partition and partition change , Facilitate business data partition strategy .
- Iceberg Support repeated query of snapshot data , With version rollback function .
- Iceberg The scan plan is fast , Reading tables or querying files can be done without the need for distributed SQL engine .
- Iceberg Efficiently filter queries through table metadata .
- Concurrency support based on optimistic locking , Provide multi-threaded concurrent writing capability and ensure linear consistency of data .
边栏推荐
- Datapipeline was selected into the 2022 digital intelligence atlas and database development report of China Academy of communications and communications
- Android本地Sqlite数据库的备份和还原
- 内网穿透工具 netapp
- Cloudcompare - point cloud slice
- TortoiseSVN使用情形、安装与使用
- Integer ==比较会自动拆箱 该变量不能赋值为空
- Catch all asynchronous artifact completable future
- JS to determine whether an element exists in the array (four methods)
- Could not set property ‘id‘ of ‘class XX‘ with value ‘XX‘ argument type mismatch 解决办法
- DataPipeline双料入选中国信通院2022数智化图谱、数据库发展报告
猜你喜欢

百度杯”CTF比赛 2017 二月场,Web:爆破-2

Can and can FD

RHCSA9

南理工在线交流群

Record in-depth learning - some bug handling

Catch all asynchronous artifact completable future

Sorry, we can't open xxxxx Docx, because there is a problem with the content (repackaging problem)

Flutter draws animation effects of wave movement, curves and line graphs

How to realize batch sending when fishing

运筹说 第68期|2022年最新影响因子正式发布 快看管科领域期刊的变化
随机推荐
使用Dom4j解析XML
asp.net 读取txt文件
[深度学习论文笔记]使用多模态MR成像分割脑肿瘤的HNF-Netv2
一网打尽异步神器CompletableFuture
DataPipeline双料入选中国信通院2022数智化图谱、数据库发展报告
Prefix, infix, suffix expression "recommended collection"
Difference between avc1 and H264
Usage, installation and use of TortoiseSVN
Get you started with Apache pseudo static configuration
国际自动机工程师学会(SAE International)战略投资几何伙伴
How to choose note taking software? Comparison and evaluation of notion, flowus and WOLAI
go 指针
Although the volume and price fall, why are the structural deposits of commercial banks favored by listed companies?
Lb10s-asemi rectifier bridge lb10s
Could not set property ‘id‘ of ‘class XX‘ with value ‘XX‘ argument type mismatch 解决办法
【MySQL 使用秘籍】一网打尽 MySQL 时间和日期类型与相关操作函数(三)
百度杯”CTF比赛 2017 二月场,Web:爆破-2
A detailed explanation of ASCII code, Unicode and UTF-8
Integer = = the comparison will unpack automatically. This variable cannot be assigned empty
leetcode 10. Regular Expression Matching 正则表达式匹配 (困难)