当前位置:网站首页>Data Lake (IX): Iceberg features and data types
Data Lake (IX): Iceberg features and data types
2022-07-07 14:31:00 【51CTO】
Iceberg Feature details and data types
One 、Iceberg Feature details
1、Iceberg Partition and hidden partition (Hidden Partition)
Iceberg Support partition to speed up data query . stay Iceberg After setting partition in , Similar rows can be grouped when writing data , Speed up query when querying .Iceberg Can be in accordance with the year 、 month 、 Day and hour granularity time stamp organization partition .
stay Hive Partition is also supported in , But if you want to make partitions faster , Need to write SQL Specify the corresponding partition conditions to filter the data , stay Iceberg Write in SQL There is no need to query SQL Partition filtering conditions are specially specified in ,Iceberg Will automatically partition , Filter out unwanted data .
stay Iceberg Partition information in can be hidden ,Iceberg The partition field of can be calculated by one field , After creating tables or modifying partition policies , The new data will automatically calculate the partition to which it belongs , When querying, you also don't need to care about what fields the table partition is , Just focus on the business logic ,Iceberg Automatic data partitioning is not required .
It is because of Iceberg The partition information and table data storage directory are independent , bring Iceberg Table partitions can be modified , And it won't involve data migration .
2、Iceberg Performative (Table Evolution)
stay Hive In the partition table , If you change a table divided by days to divided by hours , Then there is no way to modify the original table , You need to create a table partitioned by hours , Then load the data into this table .
Iceberg Support the evolution of the earth's surface , Can pass SQL Table level schema evolution , for example : Change table partition layout .Iceberg When doing the above , The price is very low , There is no time-consuming and laborious operation of reading data, rewriting or migrating data .
3、 Pattern evolution (Schema Evolution)
Iceberg The following are supported Schema Evolution of :
- ADD: Add new columns to a table or nested structure .
- Drop: Remove columns from a table or nested structure .
- Rename: Rename a column in a table or nested structure .
- Update: Complex structures (Struct、Map<Key,Value>,list) The length of the basic type extension type in , such as :tinyint Modified into int.
- Reorder: Change the order of columns , You can also change the sort order of the fields in the nested structure .
Be careful :
Iceberg Schema The change is just the operation change of metadata , It doesn't involve rewriting data files .Map Structure type does not support Add and Drop Field .
Iceberg Guarantee Schema Evolution is an independent operation without side effects , It doesn't involve rewriting data files , As follows :
- Adding a column does not read existing data from another column
- When deleting a field in a column or nested structure , Does not change the value of any other column .
- When updating a field in a column or nested structure , Does not change the value of any other column .
- When changing the order of fields in a column or nested structure , Does not change the associated value .
Iceberg For the above reasons, use the only id To track every column in the table , When adding a column , New... Will be assigned ID, Therefore, the data corresponding to the column will not be misused .
4、 Partition evolution (partition Evolution)
Iceberg Partitions can be updated in existing tables , because Iceberg The query process is not directly related to the partition information .
When we change the partition policy of a table , The data before modifying the partition will not change , The old partition strategy will still be adopted , New data will adopt a new partition strategy , In other words, the same table will have two partition strategies , The old data adopts the old partition policy , The new data adopts the new partition strategy , In metadata, the two partition policies are independent of each other , Not coincident .
therefore , Before we write SQL When making data query , If there is a cross partition policy , It will be resolved into two different execution plans , Such as Iceberg The official website provides... As shown in the figure :
In the figure booking_table surface 2008 The year is divided by month , Get into 2009 Years later, it will be divided into districts by day , These two partition policies coexist in the table . Thanks to the Iceberg Hidden partitions (Hidden Partition), For SQL Inquire about , Don't need to SQL Partition filtering conditions are specially specified in ( By month or by day ), Iceberg Will automatically partition , Filter out unwanted data .
5、 Column order evolution (Sort Order Evolution)
Iceberg You can modify the sorting policy on an existing table . After modifying the sorting policy , The old data still adopts the old sorting strategy . Go to Iceberg The computing engine that writes the data will always choose the latest sorting strategy , But when sorting is extremely expensive , No sorting .
Two 、Iceberg data type
Iceberg Table supports the following data types :
type | describe | Be careful |
boolean | Boolean type ,true perhaps false | |
int | 32 Bit signed shaping | It can be converted into long type |
long | 64 Bit signed shaping | |
float | Single precision floating point | It can be converted into double type |
double | Double precision floating point | |
decimal(P,S) | decimal(P,S) | P Represents precision , Determine the total number of digits ,S On behalf of scale , Determine the number of decimal places .P Must be less than or equal to 38. |
date | date , Time and time zone are not included | |
time | Time , Excluding date and time zone | Store in microseconds ,1000 Microsecond = 1 millisecond |
timestamp | Without time zone timestamp | Store in microseconds ,1000 Microsecond = 1 millisecond |
timestamptz | With time zone timestamp | Store in microseconds ,1000 Microsecond = 1 millisecond |
string | Any length string type | UTF-8 code |
fixed(L) | The length is L Fixed length byte array of | |
binary | An array of bytes of any length | |
struct<...> | A structured field consisting of any data type | |
list<E> | Any data type List | |
map<K,V> | Of any type K,V Of Map |
边栏推荐
- Leetcode——剑指 Offer 05. 替换空格
- Attribute keywords serveronly, sqlcolumnnumber, sqlcomputecode, sqlcomputed
- Data connection mode in low code platform (Part 2)
- AWS学习笔记(三)
- 杭电oj2092 整数解
- Bashrc and profile
- Ian Goodfellow, the inventor of Gan, officially joined deepmind as research scientist
- Decrypt the three dimensional design of the game
- Hangdian oj2054 a = = B? ???
- C # switch pages through frame and page
猜你喜欢
随机推荐
PERT图(工程网络图)
Cesium knows the longitude and latitude of one point and the distance to find the longitude and latitude of another point
Attribute keywords serveronly, sqlcolumnnumber, sqlcomputecode, sqlcomputed
今日睡眠质量记录78分
The world's first risc-v notebook computer is on pre-sale, which is designed for the meta universe!
Navigation — 这么好用的导航框架你确定不来看看?
UML sequence diagram (sequence diagram)
CVPR2022 | 医学图像分析中基于频率注入的后门攻击
The longest ascending subsequence model acwing 482 Chorus formation
libSGM的horizontal_path_aggregation程序解读
Mmkv use and principle
bashrc与profile
常用数字信号编码之反向不归零码码、曼彻斯特编码、差分曼彻斯特编码
Cvpr2022 | backdoor attack based on frequency injection in medical image analysis
LeetCode 648. Word replacement
oracle 触发器实现级联更新
Half an hour of hands-on practice of "live broadcast Lianmai construction", college students' resume of technical posts plus points get!
杭电oj2092 整数解
PD虚拟机教程:如何在ParallelsDesktop虚拟机中设置可使用的快捷键?
Selenium Library