当前位置:网站首页>Data Lake (VIII): Iceberg data storage format
Data Lake (VIII): Iceberg data storage format
2022-07-06 20:57:00 【51CTO】
Iceberg Data storage format
One 、Iceberg The term
- data files( Data files ):
The data file is Apache Iceberg Tables are files that actually store data , It's usually in the data storage directory of the table data Under the table of contents , If our file format is parquet, So the document is based on “.parquet” ending , for example :
00000-0-root_20211212192602_8036d31b-9598-4e30-8e67-ce6c39f034da-job_1639237002345_0025-00001.parquet It's just a data file .
Iceberg Each update produces multiple data files (data files).
- Snapshot( Table snapshot ):
A snapshot represents the state of a table at a certain time . Each snapshot will list all the data in the table at a certain time data files list .data files It's stored in different manifest files Inside ,manifest files Is stored in a Manifest list In the document , And one Manifest list The file represents a snapshot .
- Manifest list( List of checklists ):
manifest list Is a metadata file , It lists the snapshot of the build table (Snapshot) List of (Manifest file). What is stored in this metadata file is Manifest file list , Every Manifest file Occupy a line . Each row stores Manifest file The path of 、 Its stored data file (data files) Partition range , Added several number files 、 Deleted several data files and other information , This information can be used to provide filtering when querying , Speed up .
- Manifest file( Inventory file ):
Manifest file It is also a metadata file , It lists the components of the snapshot (snapshot) Data files for (data files) List information for . Each line is a detailed description of each data file , Including the status of the data file 、 File path 、 Zone information 、 Statistics at the column level ( For example, the maximum and minimum of each column 、 Null number, etc )、 The size of the file and the number of data lines in the file . Column level statistics can filter out unnecessary files when scanning table data .
Manifest file In order to avro Format for storage , With “.avro” The suffix ends , for example :8138fce4-40f7-41d7-82a5-922274d2abba-m0.avro.
Two 、 Table format Table Format
Apache Iceberg As a data Lake solution , It is an open table format for large analysis data sets (Table Format), Table format can be understood as an organization of metadata and data files .Iceberg The underlying data store can be docked HDFS,S3 file system , And supports a variety of file formats , In the calculation frame (Spark、Flink) under , Data files .
Here's how Iceberg How the underlying files are organized , The picture below is Iceberg Middle table format ,s0、s1 It represents the table Snapshot Information , Each represents a snapshot of the current operation , Every time commit Will generate a snapshot Snapshot, Every Snapshot The snapshot corresponds to a manifest list Metadata file , Every manifest list Contains multiple Manifest Metadata file ,manifest The file address corresponding to the data generated by the current operation is recorded in , That is to say data file The address of .
be based on snapshot Management style ,Iceberg Can get the historical version data of the table 、 Incremental read operation on table ,data files Storage supports different file formats , At present, we support parquet、ORC、Avro Format .
About Iceberg Table data underlying organization details , You can pay attention to the following articles , I will explain it in detail .
边栏推荐
- Rhcsa Road
- Trends of "software" in robotics Engineering
- HMS Core 机器学习服务打造同传翻译新“声”态,AI让国际交流更顺畅
- Yyds dry goods count re comb this of arrow function
- Intel 48 core new Xeon run point exposure: unexpected results against AMD zen3 in 3D cache
- 防火墙基础之外网服务器区部署和双机热备
- [weekly pit] calculate the sum of primes within 100 + [answer] output triangle
- What are RDB and AOF
- 快过年了,心也懒了
- Gui Gui programming (XIII) - event handling
猜你喜欢
Implementation of packaging video into MP4 format and storing it in TF Card
拼多多败诉,砍价始终差0.9%一案宣判;微信内测同一手机号可注册两个账号功能;2022年度菲尔兹奖公布|极客头条
[weekly pit] calculate the sum of primes within 100 + [answer] output triangle
Redis insert data garbled solution
No Yum source to install SPuG monitoring
Pinduoduo lost the lawsuit, and the case of bargain price difference of 0.9% was sentenced; Wechat internal test, the same mobile phone number can register two account functions; 2022 fields Awards an
Manifest of SAP ui5 framework json
Pytest (3) - Test naming rules
电子游戏的核心原理
Swagger UI教程 API 文档神器
随机推荐
Review questions of anatomy and physiology · VIII blood system
(工作记录)2020年3月11日至2021年3月15日
OLED屏幕的使用
Logic is a good thing
Tips for web development: skillfully use ThreadLocal to avoid layer by layer value transmission
use. Net drives the OLED display of Jetson nano
OLED屏幕的使用
PHP online examination system version 4.0 source code computer + mobile terminal
Math symbols in lists
Infrared thermometer based on STM32 single chip microcomputer (with face detection)
拼多多败诉,砍价始终差0.9%一案宣判;微信内测同一手机号可注册两个账号功能;2022年度菲尔兹奖公布|极客头条
[DSP] [Part 2] understand c6678 and create project
【mysql】触发器
Intel 48 core new Xeon run point exposure: unexpected results against AMD zen3 in 3D cache
逻辑是个好东西
##无yum源安装spug监控
知识图谱之实体对齐二
使用.Net驱动Jetson Nano的OLED显示屏
Distributed ID
OSPF multi zone configuration