当前位置:网站首页>Data Lake (11): Iceberg table data organization and query
Data Lake (11): Iceberg table data organization and query
2022-07-02 15:56:00 【Hua Weiyun】
Iceberg Table data organization and query
One 、 download avro-tools jar package
Because you need to check later avro The contents of the document , We can go through avro-tool.jar Check it out. avro The data content . It can be downloaded from the following website avro-tools Corresponding jar package , Download and upload to node5 Node :
https://mvnrepository.com/artifact/org.apache.avro/avro-tools
see avro The following commands can be directly executed for file information , Can be avro Convert the data in to the corresponding json data .
[[email protected] ~]# java -jar /software/avro-tools-1.8.1.jar tojson snap-*-wqer.avro
Copy Two 、 stay Hive Created in Iceberg Table and insert data
stay Hive Created in Iceberg Format table , And insert the following data :
# stay Hive Created in iceberg Format table create table test_iceberg_tbl1(id int ,name string,age int) partitioned by (dt string) stored by 'org.apache.iceberg.mr.hive.HiveIcebergStorageHandler';# Insert the following data insert into test_iceberg_tbl1 values (1,"zs",21,"20211212");insert into test_iceberg_tbl1 values (2,"ls",22,"20211212");insert into test_iceberg_tbl1 values (3,"ww",23,"20211213");insert into test_iceberg_tbl1 values (4,"ml",24,"20211213");insert into test_iceberg_tbl1 values (5,"tq",25,"20211213");
Copy 3、 ... and 、 see Iceberg Underlying data storage
The following figure for Iceberg surface “test_iceberg_tbl1” stay HDFS Data organization chart stored in :

From the above figure, we can see that there are 5 individual Snapshot snapshot , above 5 individual Snapshot In fact, it corresponds to 5 individual Manifest list List of checklists .
1、 Query the latest snapshot data
In order to understand Iceberg How to query the latest data , You can refer to the following figure to understand the underlying implementation in detail .

Inquire about Iceberg Table data , First get the latest metadata Information , Get it here first “00000-*ec504.metadata.json” Metadata information , Parsing the current metadata file can get a snapshot of the current table id:“949358624197301886” And all the snapshot information of this table , That is to say json In information snapshots The value of the array . According to the snapshot of the current table id Value to get the corresponding snapshot Corresponding avro file information :“snap-*-32800.avro”, We can find the path corresponding to the current snapshot , See what it contains Manifest The manifest file has 5 individual :"*32800-m0.avro"、"*2abba-m0.avro"、"*d33de-m0.avro"、"*748bf-m0.avro"、"*b946e-m0.avro", Read the Iceberg The latest data of the format table is to read the corresponding data described in these files parquet Data file is enough .
We can see “snap-*-32800.avro” The snapshot file contains not only manifest Path information , also “added_data_files_count”、“existing_data_files_count”、“deleted_data_files_count” Three attributes ,Iceberg according to deleted_data_files_count Greater than 0 To determine the corresponding manifest Is there any deleted data in the manifest file , If one manifest The value in the manifest file is greater than 0 Represents data deletion , You don't need to read this when reading data manifest The data file corresponding to the manifest file .
according to Manifest list The corresponding manifest Inventory file , Each document describes the corresponding parquet Location information of file storage , You can see in the corresponding avro In file “status” attribute , The attribute is 1 For the corresponding parquet The file is a new file , Read required , by 2 representative parquet File deleted .
2、 Query the data of a snapshot
Apache Iceberg Supports querying snapshots at any time in history , When querying, you need to specify snapshot-id Attribute is enough , This can only be done through Spark/Flink To query and implement , For example, in Spark Query a snapshot data in as follows :
spark.read.option("snapshot-id",6155408340798912701L).format("iceberg").load("path")
Copy The principle of querying a snapshot data is shown in the following figure ( To query the snapshot id by “6155408340798912701” As an example ):

As can be seen from the figure above , In fact, the difference between reading historical snapshot data and reading the latest data is found snapshot-id It's just different , The principle is the same .
3、 View the data of a snapshot based on the timestamp
Apache iceberg It also supports the adoption of as-of-timestamp Parameter execution timestamp to read the data of a snapshot , Also through Spark/Flink To read ,Spark Read the code as follows :
spark.read.option("as-of-timestamp"," Time stamp ").format("iceberg").load("path")
Copy In fact, the principle and method of finding the corresponding data file through timestamp snapshot-id The principle of finding data files is the same , stay *.metadata.json In file , Except for “current-snapshot-id”、“snapshots” In addition to attributes, there are “snapshot-log” attribute , The corresponding values of this attribute are as follows :

We can see one of them timestamp-ms Properties and snapshot-id attribute , And according to timestamp-ms ascend . stay Iceberg Internal implementation , It will be as-of-timestamp Designated time and snapshot-log Of each element in the array timestamp-ms Compare , Find the last satisfaction timestamp-ms <= as-of-timestamp Corresponding snapshot-id, The principle of same , adopt snapshot-id Then find the data file to read .
边栏推荐
- Why does the system convert the temp environment variable to a short file name?
- [5g NR] RRC connection release
- Idea jar package conflict troubleshooting
- GraphX 图计算实践之模式匹配抽取特定子图
- Two traversal sequences are known to construct binary trees
- Postgressql stream replication active / standby switchover primary database no read / write downtime scenario
- 基于 Nebula Graph 构建百亿关系知识图谱实践
- [salesforce] how to confirm your salesforce version?
- locate: 无法执行 stat () `/var/lib/mlocate/mlocate.db‘: 没有那个文件或目录
- 【小白聊云】中小企业容器化改造建议
猜你喜欢
《大学“电路分析基础”课程实验合集.实验五》丨线性有源二端网络等效电路的研究
已知兩種遍曆序列構造二叉樹
Experiment collection of University Course "Fundamentals of circuit analysis". Experiment 5 - Research on equivalent circuit of linear active two terminal network
数仓中的维度表与事实表
蚂蚁集团大规模图计算系统TuGraph通过国家级评测
Two traversal sequences are known to construct binary trees
PostgresSQL 流复制 主备切换 主库无读写宕机场景
基于 Nebula Graph 构建百亿关系知识图谱实践
win10系统升级一段时间后,内存占用过高
Application of visualization technology in Nebula graph
随机推荐
Idea jar package conflict troubleshooting
中科大脑知识图谱平台建设及业务实践
华为云服务器安装mysqlb for mysqld.service failed because the control process exited with error code.See “sys
/Bin/ld: cannot find -lxml2
奥比中光 astra: Could not open “2bc5/[email protected]/6“: Failed to set USB interface
/Bin/ld: cannot find -lxslt
Moveit 避障路径规划 demo
图数据库|Nebula Graph v3.1.0 性能报告
SQL修改语句
Comparison between rstan Bayesian regression model and standard linear regression model of R language MCMC
Usage of group by
Comment réaliser un graphique Nebula d'importation CSV hors ligne de niveau milliard
[salesforce] how to confirm your salesforce version?
Group by的用法
/bin/ld: 找不到 -lpam
愛可可AI前沿推介(7.2)
Flink real-time data warehouse (IX): incremental synchronization of data in MySQL
(Wanzi essence knowledge summary) basic knowledge of shell script programming
【idea】推荐一个idea翻译插件:Translation「建议收藏」
树-二叉搜索树