当前位置:网站首页>Data Lake (VII): Iceberg concept and review what is a data Lake
Data Lake (VII): Iceberg concept and review what is a data Lake
2022-07-05 13:34:00 【51CTO】
Iceberg Concept and review what a data lake is
One 、 Review what a data lake is
Data lake is a centralized repository , Allows you to store multiple sources at any size 、 All structured and unstructured data , Data can be stored as is , No need to process structured data , And run different types of analysis , Process the data , for example : Big data processing 、 Real time analysis 、 machine learning , To guide better decisions .
Two 、 Why does big data need data lake
Currently based on Hive Our offline data warehouse is very mature , It is very troublesome to update the record level data in the traditional offline data warehouse , The entire partition to which the data to be updated belongs , Even the whole table can be covered completely , Due to the architecture design of multi-level layer by layer processing of off-line data warehouse , When updating data, it also needs to be reflected layer by layer from the paste source layer to the subsequent derived tables .
With the continuous development of real-time computing engine and business demand for real-time report output continues to expand , In recent years, the industry has been focusing on and exploring the construction of real-time warehouse . According to the evolution process of data warehouse architecture , stay Lambda The architecture includes two links: offline processing and real-time processing , The architecture is shown below :
edit
It is precisely because the two links process data, resulting in data inconsistency and other problems, so there is Kappa framework ,Kappa The structure is as follows :
edit
Kappa Architecture can be called real-time data warehouse , At present, the most commonly used implementation in the industry is Flink + Kafka, However, based on Kafka+Flink The real-time data warehouse scheme also has several obvious defects , Therefore, in many enterprises, hybrid architecture is often used in the construction of real-time data warehouse , Not all businesses adopt Kappa Implementation of real-time processing in the architecture .Kappa The architecture defects are as follows :
- Kafka Can't support massive data storage . For lines of business with massive amounts of data ,Kafka Generally, it can only store data for a very short time , Like the last week , Even the last day .
- Kafka Can't support efficient OLAP Inquire about , Most businesses want to be in DWD\DWS Layer supports ad hoc queries , however Kafka It's not very friendly to support such a demand .
- It is impossible to reuse the mature data consanguinity based on offline data warehouse 、 Data quality management system . We need to re implement a set of data consanguinity 、 Data quality management system .
- Kafka I won't support it update/upsert, at present Kafka Support only append.
In order to solve Kappa The pain point of Architecture , The mainstream in the industry is to adopt “ Batch flow integration ” The way , Here, the integration of batch and stream can be understood as the use of batch and stream SQL Same processing , It can also be understood as the unification of processing framework , for example :Spark、Flink, But what's more important here is the unity on the storage layer , As long as you do it at the storage level “ Batch flow integration ” Can solve the above Kappa All kinds of problems . Data Lake technology can well realize the storage level “ Batch flow integration ”, This is why data lake is needed in big data .
3、 ... and 、Iceberg Concept and characteristics
1、 Concept
Apache Iceberg It is an open table format for large-scale data analysis scenarios (Table Format).Iceberg Use a method similar to SQL High performance table format for tables ,Iceberg The format form table can store tens of PB data , adapter Spark、Trino、PrestoDB、Flink and Hive And other computing engines provide high-performance read-write and metadata management functions ,Iceberg Is a data Lake solution .
Be careful :Trino It's the original PrestoSQL ,2020 year 12 month 27 Japan ,PrestoSQL The project was renamed Trino,Presto Split into two branches :PrestoDB、PrestorSQL.
2、 characteristic
Iceberg Very lightweight , It can be used as lib And Spark、Flink To integrate
Iceberg Official website : https://iceberg.apache.org/
Iceberg It has the following characteristics :
- Iceberg Support real-time / Batch data writing and reading , Support Spark/Flink Calculation engine .
- Iceberg Support transactions ACID, Support adding 、 Delete 、 Update data .
- Do not bind any underlying storage , Support Parquet、ORC、Avro The format is compatible with row storage and column storage .
- Iceberg Support hidden partition and partition change , Facilitate business data partition strategy .
- Iceberg Support repeated query of snapshot data , With version rollback function .
- Iceberg The scan plan is fast , Reading tables or querying files can be done without the need for distributed SQL engine .
- Iceberg Efficiently filter queries through table metadata .
- Concurrency support based on optimistic locking , Provide multi-threaded concurrent writing capability and ensure linear consistency of data .
边栏推荐
- restTemplate详解
- 49. Grouping of alphabetic ectopic words: give you a string array, please combine the alphabetic ectopic words together. You can return a list of results in any order. An alphabetic ectopic word is a
- Android本地Sqlite数据库的备份和还原
- APICloud Studio3 API管理与调试使用教程
- [深度学习论文笔记]UCTransNet:从transformer的通道角度重新思考U-Net中的跳跃连接
- [notes of in-depth study paper]transbtsv2: wider instead of deep transformer for medical image segmentation
- Parsing XML using Dom4j
- Get you started with Apache pseudo static configuration
- Flutter 3.0更新后如何应用到小程序开发中
- FPGA 学习笔记:Vivado 2019.1 添加 IP MicroBlaze
猜你喜欢
C object storage
Binder communication process and servicemanager creation process
Operational research 68 | the latest impact factors in 2022 were officially released. Changes in journals in the field of rapid care
百度杯”CTF比赛 2017 二月场,Web:爆破-2
Go array and slice
Cloudcompare - point cloud slice
数据泄露怎么办?'华生·K'7招消灭安全威胁
南理工在线交流群
[深度学习论文笔记]使用多模态MR成像分割脑肿瘤的HNF-Netv2
华为推送服务内容,阅读笔记
随机推荐
What happened to the communication industry in the first half of this year?
49. 字母异位词分组:给你一个字符串数组,请你将 字母异位词 组合在一起。可以按任意顺序返回结果列表。 字母异位词 是由重新排列源单词的字母得到的一个新单词,所有源单词中的字母通常恰好只用一次。
运筹说 第68期|2022年最新影响因子正式发布 快看管科领域期刊的变化
精彩速递|腾讯云数据库6月刊
私有地址有那些
C object storage
CloudCompare——点云切片
APICloud Studio3 API管理与调试使用教程
4年工作经验,多线程间的5种通信方式都说不出来,你敢信?
程序员成长第八篇:做好测试工作
RHCSA8
Usage, installation and use of TortoiseSVN
Changing JS code has no effect
RHCSA9
【 script secret pour l'utilisation de MySQL 】 un jeu en ligne sur l'heure et le type de date de MySQL et les fonctions d'exploitation connexes (3)
MySQL --- 数据库查询 - 排序查询、分页查询
"Baidu Cup" CTF competition in September, web:sql
FPGA learning notes: vivado 2019.1 add IP MicroBlaze
MMSeg——Mutli-view时序数据检查与可视化
How to choose note taking software? Comparison and evaluation of notion, flowus and WOLAI