当前位置:网站首页>Data Lake (VII): Iceberg concept and review what is a data Lake
Data Lake (VII): Iceberg concept and review what is a data Lake
2022-07-05 13:34:00 【51CTO】
Iceberg Concept and review what a data lake is
One 、 Review what a data lake is
Data lake is a centralized repository , Allows you to store multiple sources at any size 、 All structured and unstructured data , Data can be stored as is , No need to process structured data , And run different types of analysis , Process the data , for example : Big data processing 、 Real time analysis 、 machine learning , To guide better decisions .
Two 、 Why does big data need data lake
Currently based on Hive Our offline data warehouse is very mature , It is very troublesome to update the record level data in the traditional offline data warehouse , The entire partition to which the data to be updated belongs , Even the whole table can be covered completely , Due to the architecture design of multi-level layer by layer processing of off-line data warehouse , When updating data, it also needs to be reflected layer by layer from the paste source layer to the subsequent derived tables .
With the continuous development of real-time computing engine and business demand for real-time report output continues to expand , In recent years, the industry has been focusing on and exploring the construction of real-time warehouse . According to the evolution process of data warehouse architecture , stay Lambda The architecture includes two links: offline processing and real-time processing , The architecture is shown below :
edit
It is precisely because the two links process data, resulting in data inconsistency and other problems, so there is Kappa framework ,Kappa The structure is as follows :
edit
Kappa Architecture can be called real-time data warehouse , At present, the most commonly used implementation in the industry is Flink + Kafka, However, based on Kafka+Flink The real-time data warehouse scheme also has several obvious defects , Therefore, in many enterprises, hybrid architecture is often used in the construction of real-time data warehouse , Not all businesses adopt Kappa Implementation of real-time processing in the architecture .Kappa The architecture defects are as follows :
- Kafka Can't support massive data storage . For lines of business with massive amounts of data ,Kafka Generally, it can only store data for a very short time , Like the last week , Even the last day .
- Kafka Can't support efficient OLAP Inquire about , Most businesses want to be in DWD\DWS Layer supports ad hoc queries , however Kafka It's not very friendly to support such a demand .
- It is impossible to reuse the mature data consanguinity based on offline data warehouse 、 Data quality management system . We need to re implement a set of data consanguinity 、 Data quality management system .
- Kafka I won't support it update/upsert, at present Kafka Support only append.
In order to solve Kappa The pain point of Architecture , The mainstream in the industry is to adopt “ Batch flow integration ” The way , Here, the integration of batch and stream can be understood as the use of batch and stream SQL Same processing , It can also be understood as the unification of processing framework , for example :Spark、Flink, But what's more important here is the unity on the storage layer , As long as you do it at the storage level “ Batch flow integration ” Can solve the above Kappa All kinds of problems . Data Lake technology can well realize the storage level “ Batch flow integration ”, This is why data lake is needed in big data .
3、 ... and 、Iceberg Concept and characteristics
1、 Concept
Apache Iceberg It is an open table format for large-scale data analysis scenarios (Table Format).Iceberg Use a method similar to SQL High performance table format for tables ,Iceberg The format form table can store tens of PB data , adapter Spark、Trino、PrestoDB、Flink and Hive And other computing engines provide high-performance read-write and metadata management functions ,Iceberg Is a data Lake solution .
Be careful :Trino It's the original PrestoSQL ,2020 year 12 month 27 Japan ,PrestoSQL The project was renamed Trino,Presto Split into two branches :PrestoDB、PrestorSQL.
2、 characteristic
Iceberg Very lightweight , It can be used as lib And Spark、Flink To integrate
Iceberg Official website : https://iceberg.apache.org/
Iceberg It has the following characteristics :
- Iceberg Support real-time / Batch data writing and reading , Support Spark/Flink Calculation engine .
- Iceberg Support transactions ACID, Support adding 、 Delete 、 Update data .
- Do not bind any underlying storage , Support Parquet、ORC、Avro The format is compatible with row storage and column storage .
- Iceberg Support hidden partition and partition change , Facilitate business data partition strategy .
- Iceberg Support repeated query of snapshot data , With version rollback function .
- Iceberg The scan plan is fast , Reading tables or querying files can be done without the need for distributed SQL engine .
- Iceberg Efficiently filter queries through table metadata .
- Concurrency support based on optimistic locking , Provide multi-threaded concurrent writing capability and ensure linear consistency of data .
边栏推荐
- [深度学习论文笔记]UCTransNet:从transformer的通道角度重新思考U-Net中的跳跃连接
- DataPipeline双料入选中国信通院2022数智化图谱、数据库发展报告
- 私有地址有那些
- Intranet penetration tool NetApp
- 49. 字母异位词分组:给你一个字符串数组,请你将 字母异位词 组合在一起。可以按任意顺序返回结果列表。 字母异位词 是由重新排列源单词的字母得到的一个新单词,所有源单词中的字母通常恰好只用一次。
- Reflection and imagination on the notation like tool
- [daily question] 1200 Minimum absolute difference
- 程序员成长第八篇:做好测试工作
- Huawei push service content, read notes
- 一文详解ASCII码,Unicode与utf-8
猜你喜欢
Jenkins installation
Principle and configuration of RSTP protocol
TortoiseSVN使用情形、安装与使用
“百度杯”CTF比赛 九月场,Web:Upload
C object storage
蜀天梦图×微言科技丨达梦图数据库朋友圈+1
Word document injection (tracking word documents) incomplete
龙芯派2代烧写PMON和重装系统
go 数组与切片
Shu tianmeng map × Weiyan technology - Dream map database circle of friends + 1
随机推荐
Apicloud studio3 API management and debugging tutorial
Write macro with word
[MySQL usage Script] catch all MySQL time and date types and related operation functions (3)
峰会回顾|保旺达-合规和安全双驱动的数据安全整体防护体系
内网穿透工具 netapp
RHCSA8
Go string operation
一文详解ASCII码,Unicode与utf-8
Usage, installation and use of TortoiseSVN
真正的缓存之王,Google Guava 只是弟弟
个人组件 - 消息提示
Rocky basic command 3
[deep learning paper notes] hnf-netv2 for segmentation of brain tumors using multimodal MR imaging
With 4 years of working experience, you can't tell five ways of communication between multithreads. Dare you believe it?
【 script secret pour l'utilisation de MySQL 】 un jeu en ligne sur l'heure et le type de date de MySQL et les fonctions d'exploitation connexes (3)
[notes of in-depth study paper]uctransnet: rethink the jumping connection in u-net from the perspective of transformer channel
A detailed explanation of ASCII code, Unicode and UTF-8
go 指针
Apicloud studio3 WiFi real machine synchronization and WiFi real machine preview instructions
"Baidu Cup" CTF competition in September, web:sql