当前位置:网站首页>Data Lake (VIII): Iceberg data storage format
Data Lake (VIII): Iceberg data storage format
2022-07-06 20:57:00 【51CTO】
Iceberg Data storage format
One 、Iceberg The term
- data files( Data files ):
The data file is Apache Iceberg Tables are files that actually store data , It's usually in the data storage directory of the table data Under the table of contents , If our file format is parquet, So the document is based on “.parquet” ending , for example :
00000-0-root_20211212192602_8036d31b-9598-4e30-8e67-ce6c39f034da-job_1639237002345_0025-00001.parquet It's just a data file .
Iceberg Each update produces multiple data files (data files).
- Snapshot( Table snapshot ):
A snapshot represents the state of a table at a certain time . Each snapshot will list all the data in the table at a certain time data files list .data files It's stored in different manifest files Inside ,manifest files Is stored in a Manifest list In the document , And one Manifest list The file represents a snapshot .
- Manifest list( List of checklists ):
manifest list Is a metadata file , It lists the snapshot of the build table (Snapshot) List of (Manifest file). What is stored in this metadata file is Manifest file list , Every Manifest file Occupy a line . Each row stores Manifest file The path of 、 Its stored data file (data files) Partition range , Added several number files 、 Deleted several data files and other information , This information can be used to provide filtering when querying , Speed up .
- Manifest file( Inventory file ):
Manifest file It is also a metadata file , It lists the components of the snapshot (snapshot) Data files for (data files) List information for . Each line is a detailed description of each data file , Including the status of the data file 、 File path 、 Zone information 、 Statistics at the column level ( For example, the maximum and minimum of each column 、 Null number, etc )、 The size of the file and the number of data lines in the file . Column level statistics can filter out unnecessary files when scanning table data .
Manifest file In order to avro Format for storage , With “.avro” The suffix ends , for example :8138fce4-40f7-41d7-82a5-922274d2abba-m0.avro.
Two 、 Table format Table Format
Apache Iceberg As a data Lake solution , It is an open table format for large analysis data sets (Table Format), Table format can be understood as an organization of metadata and data files .Iceberg The underlying data store can be docked HDFS,S3 file system , And supports a variety of file formats , In the calculation frame (Spark、Flink) under , Data files .


Here's how Iceberg How the underlying files are organized , The picture below is Iceberg Middle table format ,s0、s1 It represents the table Snapshot Information , Each represents a snapshot of the current operation , Every time commit Will generate a snapshot Snapshot, Every Snapshot The snapshot corresponds to a manifest list Metadata file , Every manifest list Contains multiple Manifest Metadata file ,manifest The file address corresponding to the data generated by the current operation is recorded in , That is to say data file The address of .
be based on snapshot Management style ,Iceberg Can get the historical version data of the table 、 Incremental read operation on table ,data files Storage supports different file formats , At present, we support parquet、ORC、Avro Format .


About Iceberg Table data underlying organization details , You can pay attention to the following articles , I will explain it in detail .
边栏推荐
- Mécanisme de fonctionnement et de mise à jour de [Widget Wechat]
- Spiral square PTA
- Laravel notes - add the function of locking accounts after 5 login failures in user-defined login (improve system security)
- 如何实现常见框架
- R language visualizes the relationship between more than two classification (category) variables, uses mosaic function in VCD package to create mosaic plots, and visualizes the relationship between tw
- 面试官:Redis中有序集合的内部实现方式是什么?
- Rhcsa Road
- Recyclerview GridLayout bisects the middle blank area
- [DIY]如何制作一款个性的收音机
- Simple continuous viewing PTA
猜你喜欢

【mysql】游标的基本使用

【DSP】【第二篇】了解C6678和创建工程

Spark SQL chasing Wife Series (initial understanding)

15 millions d'employés sont faciles à gérer et la base de données native du cloud gaussdb rend le Bureau des RH plus efficace
![[DIY]如何制作一款個性的收音機](/img/fc/a371322258131d1dc617ce18490baf.jpg)
[DIY]如何制作一款個性的收音機

(work record) March 11, 2020 to March 15, 2021

Swagger UI教程 API 文档神器

How to upgrade high value-added links in the textile and clothing industry? APS to help
![[asp.net core] set the format of Web API response data -- formatfilter feature](/img/6b/e3d513f63b244f9f32555d3b3bec8c.jpg)
[asp.net core] set the format of Web API response data -- formatfilter feature

性能测试过程和计划
随机推荐
Dynamically switch data sources
[weekly pit] positive integer factorization prime factor + [solution] calculate the sum of prime numbers within 100
What is the difference between procedural SQL and C language in defining variables
R语言可视化两个以上的分类(类别)变量之间的关系、使用vcd包中的Mosaic函数创建马赛克图( Mosaic plots)、分别可视化两个、三个、四个分类变量的关系的马赛克图
Web开发小妙招:巧用ThreadLocal规避层层传值
监控界的最强王者,没有之一!
Database - how to get familiar with hundreds of tables of the project -navicat these unique skills, have you got it? (exclusive experience)
[diy] how to make a personalized radio
New database, multidimensional table platform inventory note, flowus, airtable, seatable, Vig table Vika, Feishu multidimensional table, heipayun, Zhixin information, YuQue
看过很多教程,却依然写不好一个程序,怎么破?
全网最全的新型数据库、多维表格平台盘点 Notion、FlowUs、Airtable、SeaTable、维格表 Vika、飞书多维表格、黑帕云、织信 Informat、语雀
PHP saves session data to MySQL database
“罚点球”小游戏
如何实现常见框架
Pat 1078 hashing (25 points) ⼆ times ⽅ exploration method
动态切换数据源
[weekly pit] output triangle
How to turn a multi digit number into a digital list
SSO single sign on
SAP UI5 框架的 manifest.json