当前位置:网站首页>Data Lake (11): Iceberg table data organization and query
Data Lake (11): Iceberg table data organization and query
2022-06-30 13:06:00 【Lansonli】
List of articles
Iceberg Table data organization and query
One 、 download avro-tools jar package
Two 、 stay Hive Created in Iceberg Table and insert data
3、 ... and 、 see Iceberg Underlying data storage
1、 Query the latest snapshot data
2、 Query the data of a snapshot
3、 View the data of a snapshot based on the timestamp
Iceberg Table data organization and query
One 、 download avro-tools jar package
Because you need to check later avro The contents of the document , We can go through avro-tool.jar Check it out. avro The data content . It can be downloaded from the following website avro-tools Corresponding jar package , Download and upload to node5 Node :
“https://mvnrepository.com/artifact/org.apache.avro/avro-tools”.
see avro The following commands can be directly executed for file information , Can be avro Convert the data in to the corresponding json data .
[[email protected] ~]# java -jar /software/avro-tools-1.8.1.jar tojson snap-*-wqer.avro
Two 、 stay Hive Created in Iceberg Table and insert data
stay Hive Created in Iceberg Format table , And insert the following data :
# stay Hive Created in iceberg Format table
create table test_iceberg_tbl1(
id int ,
name string,
age int)
partitioned by (dt string)
stored by 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler';
# Insert the following data
insert into test_iceberg_tbl1 values (1,"zs",21,"20211212");
insert into test_iceberg_tbl1 values (2,"ls",22,"20211212");
insert into test_iceberg_tbl1 values (3,"ww",23,"20211213");
insert into test_iceberg_tbl1 values (4,"ml",24,"20211213");
insert into test_iceberg_tbl1 values (5,"tq",25,"20211213");
3、 ... and 、 see Iceberg Underlying data storage
The following figure for Iceberg surface “test_iceberg_tbl1” stay HDFS Data organization chart stored in :
From the above figure, we can see that there are 5 individual Snapshot snapshot , above 5 individual Snapshot In fact, it corresponds to 5 individual Manifest list List of checklists .
1、 Query the latest snapshot data
In order to understand Iceberg How to query the latest data , You can refer to the following figure to understand the underlying implementation in detail .
Inquire about Iceberg Table data , First get the latest metadata Information , Get it here first “00000-*ec504.metadata.json” Metadata information , Parsing the current metadata file can get a snapshot of the current table id:“949358624197301886” And all the snapshot information of this table , That is to say json In information snapshots The value of the array . According to the snapshot of the current table id Value to get the corresponding snapshot Corresponding avro file information :“snap-*-32800.avro”, We can find the path corresponding to the current snapshot , See what it contains Manifest The manifest file has 5 individual :"*32800-m0.avro"、"*2abba-m0.avro"、"*d33de-m0.avro"、"*748bf-m0.avro"、"*b946e-m0.avro", Read the Iceberg The latest data of the format table is to read the corresponding data described in these files parquet Data file is enough .
We can see “snap-*-32800.avro” The snapshot file contains not only manifest Path information , also “added_data_files_count”、“existing_data_files_count”、“deleted_data_files_count” Three attributes ,Iceberg according to deleted_data_files_count Greater than 0 To determine the corresponding manifest Is there any deleted data in the manifest file , If one manifest The value in the manifest file is greater than 0 Represents data deletion , You don't need to read this when reading data manifest The data file corresponding to the manifest file .
according to Manifest list The corresponding manifest Inventory file , Each document describes the corresponding parquet Location information of file storage , You can see in the corresponding avro In file “status” attribute , The attribute is 1 For the corresponding parquet The file is a new file , Read required , by 2 representative parquet File deleted .
2、 Query the data of a snapshot
Apache Iceberg Supports querying snapshots at any time in history , When querying, you need to specify snapshot-id Attribute is enough , This can only be done through Spark/Flink To query and implement , For example, in Spark Query a snapshot data in as follows :
spark.read.option("snapshot-id",6155408340798912701L).format("iceberg").load("path")
The principle of querying a snapshot data is shown in the following figure ( To query the snapshot id by “6155408340798912701” As an example ):
As can be seen from the figure above , In fact, the difference between reading historical snapshot data and reading the latest data is found snapshot-id It's just different , The principle is the same .
3、 View the data of a snapshot based on the timestamp
Apache iceberg It also supports the adoption of as-of-timestamp Parameter execution timestamp to read the data of a snapshot , Also through Spark/Flink To read ,Spark Read the code as follows :
spark.read.option("as-of-timestamp"," Time stamp ").format("iceberg").load("path")
In fact, the principle and method of finding the corresponding data file through timestamp snapshot-id The principle of finding data files is the same , stay *.metadata.json In file , Except for “current-snapshot-id”、“snapshots” In addition to attributes, there are “snapshot-log” attribute , The corresponding values of this attribute are as follows :
We can see one of them timestamp-ms Properties and snapshot-id attribute , And according to timestamp-ms ascend . stay Iceberg Internal implementation , It will be as-of-timestamp Designated time and snapshot-log Of each element in the array timestamp-ms Compare , Find the last satisfaction timestamp-ms <= as-of-timestamp Corresponding snapshot-id, The principle of same , adopt snapshot-id Then find the data file to read .
- Blog home page :https://lansonli.blog.csdn.net
- Welcome to thumb up Collection Leaving a message. Please correct any mistakes !
- This paper is written by Lansonli original , First appeared in CSDN Blog
- When you stop to rest, don't forget that others are still running , I hope you will seize the time to learn , Go all out for a better life
边栏推荐
- Basic syntax of unity script (5) - vector
- MySQL queries the data within the radius according to the longitude and latitude, and draws a circle to query the database
- Terms related to JMeter performance test and performance test passing standards
- [learn awk in one day] operator
- Wechat applet reports an error: typeerror: cannot read property 'SetData' of undefined
- ERROR: Cannot uninstall ‘PyYAML‘. It is a distutils installed project and thus we cannot accurately
- 资源变现小程序开通微信官方小商店教程
- Methodology for troubleshooting problems (applicable to troubleshooting problems arising from any multi-party cooperation)
- JS converts an array to a two-dimensional array based on the same value
- 黑马笔记---包装类,正则表达式,Arrays类
猜你喜欢
rpm2rpm 打包步骤
RK356x U-Boot研究所(命令篇)3.2 help命令的用法
Questionnaire star questionnaire packet capturing analysis
Goods and services - platform properties
【招聘(广州)】成功易(广州).Net Core中高级开发工程师
60 divine vs Code plug-ins!!
Dark horse notes - common date API
[one day learning awk] array usage
[qnx hypervisor 2.2 user manual]6.2.3 communication between guest and external
WTM重大更新,多租户和单点登录
随机推荐
[one day learning awk] use of built-in variables
MySQL judges the calculation result and divides it by 100
golang基础 —— 切片和数组的区别
波卡跨链通信源码探秘: 要素篇
[yitianxue awk] regular matching
[surprised] the download speed of Xunlei is not as fast as that of the virtual machine
【驚了】迅雷下載速度竟然比不上虛擬機中的下載速度
Idea 2021.3 golang error: rning: undefined behavior version of delve is too old for go version 1.18
Derivation of Park transformation formula for motor control
Unity脚本的基础语法(4)-访问其他游戏对象
【C】深入理解指针、回调函数(介绍模拟qsort)
Flinksql customizes udatf to implement topn
Package based on thinkphp5 -tronapi- wave field interface - source code without encryption - can be opened twice - interface document attached - detailed guidance of the author - June 30, 2022 08:45:2
Dark horse notes -- List series collections and generics
Resource realization applet opening wechat official small store tutorial
Introduction to the novelty of substrate source code: comprehensive update of Boca system Boca weight calculation, optimization and adjustment of governance version 2.0
Common UI components
Unity的脚本的基础语法(2)-Unity中记录时间
The independent station is Web3.0. The national "14th five year plan" requires enterprises to build digital websites!
Dataworks synchronizes maxcomputer to sqlserver. Chinese characters become garbled. How can I solve it