当前位置:网站首页>Data Lake (11): Iceberg table data organization and query
Data Lake (11): Iceberg table data organization and query
2022-07-02 14:07:00 【Lanson】
Iceberg Table data organization and query
One 、 download avro-tools jar package
Because you need to check later avro The contents of the document , We can go through avro-tool.jar Check it out. avro The data content . It can be downloaded from the following website avro-tools Corresponding jar package , Download and upload to node5 Node :
https://mvnrepository.com/artifact/org.apache.avro/avro-tools
see avro The following commands can be directly executed for file information , Can be avro Convert the data in to the corresponding json data .
[[email protected] ~]# java -jar /software/avro-tools-1.8.1.jar tojson snap-*-wqer.avro
Two 、 stay Hive Created in Iceberg Table and insert data
stay Hive Created in Iceberg Format table , And insert the following data :
# stay Hive Created in iceberg Format table
create table test_iceberg_tbl1(
id int ,
name string,
age int)
partitioned by (dt string)
stored by 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler';
# Insert the following data
insert into test_iceberg_tbl1 values (1,"zs",21,"20211212");
insert into test_iceberg_tbl1 values (2,"ls",22,"20211212");
insert into test_iceberg_tbl1 values (3,"ww",23,"20211213");
insert into test_iceberg_tbl1 values (4,"ml",24,"20211213");
insert into test_iceberg_tbl1 values (5,"tq",25,"20211213");
3、 ... and 、 see Iceberg Underlying data storage
The following figure for Iceberg surface “test_iceberg_tbl1” stay HDFS Data organization chart stored in :
From the above figure, we can see that there are 5 individual Snapshot snapshot , above 5 individual Snapshot In fact, it corresponds to 5 individual Manifest list List of checklists .
1、 Query the latest snapshot data
In order to understand Iceberg How to query the latest data , You can refer to the following figure to understand the underlying implementation in detail .
Inquire about Iceberg Table data , First get the latest metadata Information , Get it here first “00000-*ec504.metadata.json” Metadata information , Parsing the current metadata file can get a snapshot of the current table id:“949358624197301886” And all the snapshot information of this table , That is to say json In information snapshots The value of the array . According to the snapshot of the current table id Value to get the corresponding snapshot Corresponding avro file information :“snap-*-32800.avro”, We can find the path corresponding to the current snapshot , See what it contains Manifest The manifest file has 5 individual :"*32800-m0.avro"、"*2abba-m0.avro"、"*d33de-m0.avro"、"*748bf-m0.avro"、"*b946e-m0.avro", Read the Iceberg The latest data of the format table is to read the corresponding data described in these files parquet Data file is enough .
We can see “snap-*-32800.avro” The snapshot file contains not only manifest Path information , also “added_data_files_count”、“existing_data_files_count”、“deleted_data_files_count” Three attributes ,Iceberg according to deleted_data_files_count Greater than 0 To determine the corresponding manifest Is there any deleted data in the manifest file , If one manifest The value in the manifest file is greater than 0 Represents data deletion , You don't need to read this when reading data manifest The data file corresponding to the manifest file .
according to Manifest list The corresponding manifest Inventory file , Each document describes the corresponding parquet Location information of file storage , You can see in the corresponding avro In file “status” attribute , The attribute is 1 For the corresponding parquet The file is a new file , Read required , by 2 representative parquet File deleted .
2、 Query the data of a snapshot
Apache Iceberg Supports querying snapshots at any time in history , When querying, you need to specify snapshot-id Attribute is enough , This can only be done through Spark/Flink To query and implement , For example, in Spark Query a snapshot data in as follows :
spark.read.option("snapshot-id",6155408340798912701L).format("iceberg").load("path")
The principle of querying a snapshot data is shown in the following figure ( To query the snapshot id by “6155408340798912701” As an example ):
As can be seen from the figure above , In fact, the difference between reading historical snapshot data and reading the latest data is found snapshot-id It's just different , The principle is the same .
3、 View the data of a snapshot based on the timestamp
Apache iceberg It also supports the adoption of as-of-timestamp Parameter execution timestamp to read the data of a snapshot , Also through Spark/Flink To read ,Spark Read the code as follows :
spark.read.option("as-of-timestamp"," Time stamp ").format("iceberg").load("path")
In fact, the principle and method of finding the corresponding data file through timestamp snapshot-id The principle of finding data files is the same , stay *.metadata.json In file , Except for “current-snapshot-id”、“snapshots” In addition to attributes, there are “snapshot-log” attribute , The corresponding values of this attribute are as follows :
We can see one of them timestamp-ms Properties and snapshot-id attribute , And according to timestamp-ms ascend . stay Iceberg Internal implementation , It will be as-of-timestamp Designated time and snapshot-log Of each element in the array timestamp-ms Compare , Find the last satisfaction timestamp-ms <= as-of-timestamp Corresponding snapshot-id, The principle of same , adopt snapshot-id Then find the data file to read .
边栏推荐
- (POJ - 1308)Is It A Tree? (tree)
- (POJ - 1984) navigation nightare (weighted and search set)
- Sum of the first n terms of Fibonacci (fast power of matrix)
- BeanUtils--浅拷贝--实例/原理
- Do you know that there is an upper limit on the size of Oracle data files?
- Daily practice of C language --- monkeys divide peaches
- Code implementation MNLM
- ArrayList and LinkedList
- Selenium installing selenium in pycharm
- Whole house Wi Fi: a pain point that no one can solve?
猜你喜欢
rxjs Observable 自定义 Operator 的开发技巧
[development environment] 010 editor tool (tool download | binary file analysis template template installation | shortcut key viewing and setting)
MySQL45讲——学习极客时间MySQL实战45讲笔记—— 04 | 深入浅出索引(上)
Error: eacces: permission denied, access to "/usr/lib/node_modules"
When tidb meets Flink: tidb efficiently enters the lake "new play" | tilaker team interview
Find love for speed in F1 delta time Grand Prix
Skillfully use SSH to get through the Internet restrictions
Selenium installing selenium in pycharm
The global special paper revenue in 2021 was about $27 million, and it is expected to reach $35 million in 2028. From 2022 to 2028, the CAGR was 3.8%
In 2021, the global styrene butadiene styrene (SBS) revenue was about $3722.7 million, and it is expected to reach $5679.6 million in 2028
随机推荐
瀏覽器驅動的下載
The global special paper revenue in 2021 was about $27 million, and it is expected to reach $35 million in 2028. From 2022 to 2028, the CAGR was 3.8%
每天坚持20分钟go的基础二
浏览器驱动的下载
rxjs Observable 自定义 Operator 的开发技巧
路由(二)
Skillfully use SSH to get through the Internet restrictions
Subcontracting configuration of uniapp applet subpackages
How to explain binary search to my sister? This is really difficult, fan!
Qt原代码基本知识
Codeforces Round #803 (Div. 2)(A~D)
Multi rotor aircraft control using PID and LQR controllers
你的 Sleep 服务会梦到服务网格外的 bookinfo 吗
Code implementation MNLM
Getting started with QT - making a simple calculator
无主灯设计:如何让智能照明更加「智能」?
自定义事件,全局事件总线,消息订阅与发布,$nextTick
Solve "sub number integer", "jump happily", "turn on the light"
Memory management 01 - link script
联合搜索:搜索中的所有需求