当前位置:网站首页>Data Lake (VIII): Iceberg data storage format
Data Lake (VIII): Iceberg data storage format
2022-07-06 20:57:00 【51CTO】
Iceberg Data storage format
One 、Iceberg The term
- data files( Data files ):
The data file is Apache Iceberg Tables are files that actually store data , It's usually in the data storage directory of the table data Under the table of contents , If our file format is parquet, So the document is based on “.parquet” ending , for example :
00000-0-root_20211212192602_8036d31b-9598-4e30-8e67-ce6c39f034da-job_1639237002345_0025-00001.parquet It's just a data file .
Iceberg Each update produces multiple data files (data files).
- Snapshot( Table snapshot ):
A snapshot represents the state of a table at a certain time . Each snapshot will list all the data in the table at a certain time data files list .data files It's stored in different manifest files Inside ,manifest files Is stored in a Manifest list In the document , And one Manifest list The file represents a snapshot .
- Manifest list( List of checklists ):
manifest list Is a metadata file , It lists the snapshot of the build table (Snapshot) List of (Manifest file). What is stored in this metadata file is Manifest file list , Every Manifest file Occupy a line . Each row stores Manifest file The path of 、 Its stored data file (data files) Partition range , Added several number files 、 Deleted several data files and other information , This information can be used to provide filtering when querying , Speed up .
- Manifest file( Inventory file ):
Manifest file It is also a metadata file , It lists the components of the snapshot (snapshot) Data files for (data files) List information for . Each line is a detailed description of each data file , Including the status of the data file 、 File path 、 Zone information 、 Statistics at the column level ( For example, the maximum and minimum of each column 、 Null number, etc )、 The size of the file and the number of data lines in the file . Column level statistics can filter out unnecessary files when scanning table data .
Manifest file In order to avro Format for storage , With “.avro” The suffix ends , for example :8138fce4-40f7-41d7-82a5-922274d2abba-m0.avro.
Two 、 Table format Table Format
Apache Iceberg As a data Lake solution , It is an open table format for large analysis data sets (Table Format), Table format can be understood as an organization of metadata and data files .Iceberg The underlying data store can be docked HDFS,S3 file system , And supports a variety of file formats , In the calculation frame (Spark、Flink) under , Data files .
Here's how Iceberg How the underlying files are organized , The picture below is Iceberg Middle table format ,s0、s1 It represents the table Snapshot Information , Each represents a snapshot of the current operation , Every time commit Will generate a snapshot Snapshot, Every Snapshot The snapshot corresponds to a manifest list Metadata file , Every manifest list Contains multiple Manifest Metadata file ,manifest The file address corresponding to the data generated by the current operation is recorded in , That is to say data file The address of .
be based on snapshot Management style ,Iceberg Can get the historical version data of the table 、 Incremental read operation on table ,data files Storage supports different file formats , At present, we support parquet、ORC、Avro Format .
About Iceberg Table data underlying organization details , You can pay attention to the following articles , I will explain it in detail .
边栏推荐
- 审稿人dis整个研究方向已经不仅仅是在审我的稿子了怎么办?
- Intel 48 core new Xeon run point exposure: unexpected results against AMD zen3 in 3D cache
- C language games - three chess
- Huawei device command
- Function optimization and arrow function of ES6
- Is it safe to open an account in flush? Which securities company is good at opening an account? Low handling charges
- Implementation of packaging video into MP4 format and storing it in TF Card
- Introduction to the use of SAP Fiori application index tool and SAP Fiori tools
- Variable star --- article module (1)
- Quel genre de programmation les enfants apprennent - ils?
猜你喜欢
I've seen many tutorials, but I still can't write a program well. How can I break it?
The most comprehensive new database in the whole network, multidimensional table platform inventory note, flowus, airtable, seatable, Vig table Vika, flying Book Multidimensional table, heipayun, Zhix
None of the strongest kings in the monitoring industry!
Application layer of tcp/ip protocol cluster
Reinforcement learning - learning notes 5 | alphago
15 millions d'employés sont faciles à gérer et la base de données native du cloud gaussdb rend le Bureau des RH plus efficace
##无yum源安装spug监控
监控界的最强王者,没有之一!
Hardware development notes (10): basic process of hardware development, making a USB to RS232 module (9): create ch340g/max232 package library sop-16 and associate principle primitive devices
[asp.net core] set the format of Web API response data -- formatfilter feature
随机推荐
Laravel笔记-自定义登录中新增登录5次失败锁账户功能(提高系统安全性)
Logic is a good thing
Pat 1085 perfect sequence (25 points) perfect sequence
What is the difference between procedural SQL and C language in defining variables
Manifest of SAP ui5 framework json
1500万员工轻松管理,云原生数据库GaussDB让HR办公更高效
Reinforcement learning - learning notes 5 | alphago
New database, multidimensional table platform inventory note, flowus, airtable, seatable, Vig table Vika, Feishu multidimensional table, heipayun, Zhixin information, YuQue
面试官:Redis中有序集合的内部实现方式是什么?
Trends of "software" in robotics Engineering
“罚点球”小游戏
SAP Fiori应用索引大全工具和 SAP Fiori Tools 的使用介绍
'class file has wrong version 52.0, should be 50.0' - class file has wrong version 52.0, should be 50.0
Gui Gui programming (XIII) - event handling
15 millions d'employés sont faciles à gérer et la base de données native du cloud gaussdb rend le Bureau des RH plus efficace
动态切换数据源
基于STM32单片机设计的红外测温仪(带人脸检测)
Tips for web development: skillfully use ThreadLocal to avoid layer by layer value transmission
[weekly pit] information encryption + [answer] positive integer factorization prime factor
Quel genre de programmation les enfants apprennent - ils?