当前位置:网站首页>Data Lake (11): Iceberg table data organization and query
Data Lake (11): Iceberg table data organization and query
2022-06-30 13:06:00 【Lansonli】

List of articles
Iceberg Table data organization and query
One 、 download avro-tools jar package
Two 、 stay Hive Created in Iceberg Table and insert data
3、 ... and 、 see Iceberg Underlying data storage
1、 Query the latest snapshot data
2、 Query the data of a snapshot
3、 View the data of a snapshot based on the timestamp
Iceberg Table data organization and query
One 、 download avro-tools jar package
Because you need to check later avro The contents of the document , We can go through avro-tool.jar Check it out. avro The data content . It can be downloaded from the following website avro-tools Corresponding jar package , Download and upload to node5 Node :
“https://mvnrepository.com/artifact/org.apache.avro/avro-tools”.
see avro The following commands can be directly executed for file information , Can be avro Convert the data in to the corresponding json data .
[[email protected] ~]# java -jar /software/avro-tools-1.8.1.jar tojson snap-*-wqer.avroTwo 、 stay Hive Created in Iceberg Table and insert data
stay Hive Created in Iceberg Format table , And insert the following data :
# stay Hive Created in iceberg Format table
create table test_iceberg_tbl1(
id int ,
name string,
age int)
partitioned by (dt string)
stored by 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler';
# Insert the following data
insert into test_iceberg_tbl1 values (1,"zs",21,"20211212");
insert into test_iceberg_tbl1 values (2,"ls",22,"20211212");
insert into test_iceberg_tbl1 values (3,"ww",23,"20211213");
insert into test_iceberg_tbl1 values (4,"ml",24,"20211213");
insert into test_iceberg_tbl1 values (5,"tq",25,"20211213");3、 ... and 、 see Iceberg Underlying data storage
The following figure for Iceberg surface “test_iceberg_tbl1” stay HDFS Data organization chart stored in :

From the above figure, we can see that there are 5 individual Snapshot snapshot , above 5 individual Snapshot In fact, it corresponds to 5 individual Manifest list List of checklists .
1、 Query the latest snapshot data
In order to understand Iceberg How to query the latest data , You can refer to the following figure to understand the underlying implementation in detail .

Inquire about Iceberg Table data , First get the latest metadata Information , Get it here first “00000-*ec504.metadata.json” Metadata information , Parsing the current metadata file can get a snapshot of the current table id:“949358624197301886” And all the snapshot information of this table , That is to say json In information snapshots The value of the array . According to the snapshot of the current table id Value to get the corresponding snapshot Corresponding avro file information :“snap-*-32800.avro”, We can find the path corresponding to the current snapshot , See what it contains Manifest The manifest file has 5 individual :"*32800-m0.avro"、"*2abba-m0.avro"、"*d33de-m0.avro"、"*748bf-m0.avro"、"*b946e-m0.avro", Read the Iceberg The latest data of the format table is to read the corresponding data described in these files parquet Data file is enough .
We can see “snap-*-32800.avro” The snapshot file contains not only manifest Path information , also “added_data_files_count”、“existing_data_files_count”、“deleted_data_files_count” Three attributes ,Iceberg according to deleted_data_files_count Greater than 0 To determine the corresponding manifest Is there any deleted data in the manifest file , If one manifest The value in the manifest file is greater than 0 Represents data deletion , You don't need to read this when reading data manifest The data file corresponding to the manifest file .
according to Manifest list The corresponding manifest Inventory file , Each document describes the corresponding parquet Location information of file storage , You can see in the corresponding avro In file “status” attribute , The attribute is 1 For the corresponding parquet The file is a new file , Read required , by 2 representative parquet File deleted .
2、 Query the data of a snapshot
Apache Iceberg Supports querying snapshots at any time in history , When querying, you need to specify snapshot-id Attribute is enough , This can only be done through Spark/Flink To query and implement , For example, in Spark Query a snapshot data in as follows :
spark.read.option("snapshot-id",6155408340798912701L).format("iceberg").load("path")The principle of querying a snapshot data is shown in the following figure ( To query the snapshot id by “6155408340798912701” As an example ):

As can be seen from the figure above , In fact, the difference between reading historical snapshot data and reading the latest data is found snapshot-id It's just different , The principle is the same .
3、 View the data of a snapshot based on the timestamp
Apache iceberg It also supports the adoption of as-of-timestamp Parameter execution timestamp to read the data of a snapshot , Also through Spark/Flink To read ,Spark Read the code as follows :
spark.read.option("as-of-timestamp"," Time stamp ").format("iceberg").load("path")In fact, the principle and method of finding the corresponding data file through timestamp snapshot-id The principle of finding data files is the same , stay *.metadata.json In file , Except for “current-snapshot-id”、“snapshots” In addition to attributes, there are “snapshot-log” attribute , The corresponding values of this attribute are as follows :

We can see one of them timestamp-ms Properties and snapshot-id attribute , And according to timestamp-ms ascend . stay Iceberg Internal implementation , It will be as-of-timestamp Designated time and snapshot-log Of each element in the array timestamp-ms Compare , Find the last satisfaction timestamp-ms <= as-of-timestamp Corresponding snapshot-id, The principle of same , adopt snapshot-id Then find the data file to read .
- Blog home page :https://lansonli.blog.csdn.net
- Welcome to thumb up Collection Leaving a message. Please correct any mistakes !
- This paper is written by Lansonli original , First appeared in CSDN Blog
- When you stop to rest, don't forget that others are still running , I hope you will seize the time to learn , Go all out for a better life
边栏推荐
- Database usage in QT
- [one day learning awk] array usage
- 第十三章 信号(三)- 示例演示
- Exploring the source code of Boca Cross Chain Communication: Elements
- Motor control Clarke( α/β) Derivation of equal amplitude transformation
- rxjs Observable 两大类操作符简介
- Tronapi-波场接口-PHP版本--附接口文档-基于ThinkPHP5封装-源码无加密-可二开-作者详细指导-2022年6月28日11:49:56
- Kubeedge's core philosophy
- Shell编程概述
- 产品经理专业知识50篇(七)-如何建立一套完整的用户成长体系?
猜你喜欢

rxjs Observable 两大类操作符简介

深度长文探讨Join运算的简化和提速

App wechat payment unicloud version of uniapp payment (with source code)

数据湖(十一):Iceberg表数据组织与查询

【驚了】迅雷下載速度竟然比不上虛擬機中的下載速度

WTM重大更新,多租户和单点登录
![[Select] resource realization information, news, we media, blog applet (can be drained, open traffic master, with PC background management)](/img/e7/1c34d8aa364b944688ec2ffb4feb7c.jpg)
[Select] resource realization information, news, we media, blog applet (can be drained, open traffic master, with PC background management)

黑马笔记---List系列集合与泛型

Hangzhou E-Commerce Research Institute: the official website (website) is the only form of private domain

Derivation of Park transformation formula for motor control
随机推荐
[one day learning awk] use of built-in variables
JMeter learning notes
Visual studio configures QT and implements project packaging through NSIS
项目中遇到一个有趣的事情
力扣之螺旋矩阵,一起旋转起来(都能看懂)
Clearing TinyMCE rich text cache in elementui
机器学习笔记 - 自相关和偏自相关简介
[one day learning awk] array usage
IDEA 2021.3 执行 golang 报错:RNING: undefined behavior version of Delve is too old for Go version 1.18
微信小程序报错:TypeError: Cannot read property ‘setData‘ of undefined
Basic syntax of unity script (5) - vector
Golang foundation -- slicing several declaration methods
postman 自动生成 curl 代码片段
FFMpeg AVBufferPool 的理解与掌握
Mqtt ROS simulates publishing a custom message type
Rk356x u-boot Institute (command section) 3.2 usage of help command
【C】深入理解指针、回调函数(介绍模拟qsort)
电机控制park变换公式推导
基于ThinkPHP5封装tronapi-波场接口-PHP版本--附接口文档-20220627
论文解读(AGC)《Attributed Graph Clustering via Adaptive Graph Convolution》