当前位置：网站首页>Data Lake (VII): Iceberg concept and review what is a data Lake

Data Lake (VII): Iceberg concept and review what is a data Lake

2022-07-05 13:34:00 【51CTO】

Iceberg Concept and review what a data lake is

One 、 Review what a data lake is

Data lake is a centralized repository , Allows you to store multiple sources at any size 、 All structured and unstructured data , Data can be stored as is , No need to process structured data , And run different types of analysis , Process the data , for example ： Big data processing 、 Real time analysis 、 machine learning , To guide better decisions .

Two 、 Why does big data need data lake

Currently based on Hive Our offline data warehouse is very mature , It is very troublesome to update the record level data in the traditional offline data warehouse , The entire partition to which the data to be updated belongs , Even the whole table can be covered completely , Due to the architecture design of multi-level layer by layer processing of off-line data warehouse , When updating data, it also needs to be reflected layer by layer from the paste source layer to the subsequent derived tables .

With the continuous development of real-time computing engine and business demand for real-time report output continues to expand , In recent years, the industry has been focusing on and exploring the construction of real-time warehouse . According to the evolution process of data warehouse architecture , stay Lambda The architecture includes two links: offline processing and real-time processing , The architecture is shown below ：

Data Lake （ 7、 ... and ）：Iceberg Concept and review what a data lake is _kafka edit

It is precisely because the two links process data, resulting in data inconsistency and other problems, so there is Kappa framework ,Kappa The structure is as follows ：

Data Lake （ 7、 ... and ）：Iceberg Concept and review what a data lake is _sql_03 edit

Kappa Architecture can be called real-time data warehouse , At present, the most commonly used implementation in the industry is Flink + Kafka, However, based on Kafka+Flink The real-time data warehouse scheme also has several obvious defects , Therefore, in many enterprises, hybrid architecture is often used in the construction of real-time data warehouse , Not all businesses adopt Kappa Implementation of real-time processing in the architecture .Kappa The architecture defects are as follows ：

Kafka Can't support massive data storage . For lines of business with massive amounts of data ,Kafka Generally, it can only store data for a very short time , Like the last week , Even the last day .
Kafka Can't support efficient OLAP Inquire about , Most businesses want to be in DWD\DWS Layer supports ad hoc queries , however Kafka It's not very friendly to support such a demand .
It is impossible to reuse the mature data consanguinity based on offline data warehouse 、 Data quality management system . We need to re implement a set of data consanguinity 、 Data quality management system .
Kafka I won't support it update/upsert, at present Kafka Support only append.

In order to solve Kappa The pain point of Architecture , The mainstream in the industry is to adopt “ Batch flow integration ” The way , Here, the integration of batch and stream can be understood as the use of batch and stream SQL Same processing , It can also be understood as the unification of processing framework , for example ：Spark、Flink, But what's more important here is the unity on the storage layer , As long as you do it at the storage level “ Batch flow integration ” Can solve the above Kappa All kinds of problems . Data Lake technology can well realize the storage level “ Batch flow integration ”, This is why data lake is needed in big data .

3、 ... and 、Iceberg Concept and characteristics

1、 Concept

Apache Iceberg It is an open table format for large-scale data analysis scenarios （Table Format）.Iceberg Use a method similar to SQL High performance table format for tables ,Iceberg The format form table can store tens of PB data , adapter Spark、Trino、PrestoDB、Flink and Hive And other computing engines provide high-performance read-write and metadata management functions ,Iceberg Is a data Lake solution .

Be careful ：Trino It's the original PrestoSQL ,2020 year 12 month 27 Japan ,PrestoSQL The project was renamed Trino,Presto Split into two branches ：PrestoDB、PrestorSQL.

2、 characteristic

Iceberg Very lightweight , It can be used as lib And Spark、Flink To integrate

Iceberg Official website ： https://iceberg.apache.org/

Iceberg It has the following characteristics ：

Iceberg Support real-time / Batch data writing and reading , Support Spark/Flink Calculation engine .
Iceberg Support transactions ACID, Support adding 、 Delete 、 Update data .
Do not bind any underlying storage , Support Parquet、ORC、Avro The format is compatible with row storage and column storage .
Iceberg Support hidden partition and partition change , Facilitate business data partition strategy .
Iceberg Support repeated query of snapshot data , With version rollback function .
Iceberg The scan plan is fast , Reading tables or querying files can be done without the need for distributed SQL engine .
Iceberg Efficiently filter queries through table metadata .
Concurrency support based on optimistic locking , Provide multi-threaded concurrent writing capability and ensure linear consistency of data .

原网站

版权声明
本文为[51CTO]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/186/202207051233059911.html