当前位置：网站首页>Data Lake (IX): Iceberg features and data types

Data Lake (IX): Iceberg features and data types

2022-07-07 14:31:00 【51CTO】

Iceberg Feature details and data types

One 、Iceberg Feature details

1、Iceberg Partition and hidden partition （Hidden Partition）

Iceberg Support partition to speed up data query . stay Iceberg After setting partition in , Similar rows can be grouped when writing data , Speed up query when querying .Iceberg Can be in accordance with the year 、 month 、 Day and hour granularity time stamp organization partition .

stay Hive Partition is also supported in , But if you want to make partitions faster , Need to write SQL Specify the corresponding partition conditions to filter the data , stay Iceberg Write in SQL There is no need to query SQL Partition filtering conditions are specially specified in ,Iceberg Will automatically partition , Filter out unwanted data .

stay Iceberg Partition information in can be hidden ,Iceberg The partition field of can be calculated by one field , After creating tables or modifying partition policies , The new data will automatically calculate the partition to which it belongs , When querying, you also don't need to care about what fields the table partition is , Just focus on the business logic ,Iceberg Automatic data partitioning is not required .

It is because of Iceberg The partition information and table data storage directory are independent , bring Iceberg Table partitions can be modified , And it won't involve data migration .

2、Iceberg Performative （Table Evolution）

stay Hive In the partition table , If you change a table divided by days to divided by hours , Then there is no way to modify the original table , You need to create a table partitioned by hours , Then load the data into this table .

Iceberg Support the evolution of the earth's surface , Can pass SQL Table level schema evolution , for example ： Change table partition layout .Iceberg When doing the above , The price is very low , There is no time-consuming and laborious operation of reading data, rewriting or migrating data .

3、 Pattern evolution （Schema Evolution）

Iceberg The following are supported Schema Evolution of ：

ADD: Add new columns to a table or nested structure .
Drop: Remove columns from a table or nested structure .
Rename: Rename a column in a table or nested structure .
Update: Complex structures （Struct、Map<Key,Value>,list） The length of the basic type extension type in , such as ：tinyint Modified into int.
Reorder: Change the order of columns , You can also change the sort order of the fields in the nested structure .

Be careful ：

Iceberg Schema The change is just the operation change of metadata , It doesn't involve rewriting data files .Map Structure type does not support Add and Drop Field .

Iceberg Guarantee Schema Evolution is an independent operation without side effects , It doesn't involve rewriting data files , As follows ：

Adding a column does not read existing data from another column
When deleting a field in a column or nested structure , Does not change the value of any other column .
When updating a field in a column or nested structure , Does not change the value of any other column .
When changing the order of fields in a column or nested structure , Does not change the associated value .

Iceberg For the above reasons, use the only id To track every column in the table , When adding a column , New... Will be assigned ID, Therefore, the data corresponding to the column will not be misused .

4、 Partition evolution （partition Evolution）

Iceberg Partitions can be updated in existing tables , because Iceberg The query process is not directly related to the partition information .

When we change the partition policy of a table , The data before modifying the partition will not change , The old partition strategy will still be adopted , New data will adopt a new partition strategy , In other words, the same table will have two partition strategies , The old data adopts the old partition policy , The new data adopts the new partition strategy , In metadata, the two partition policies are independent of each other , Not coincident .

therefore , Before we write SQL When making data query , If there is a cross partition policy , It will be resolved into two different execution plans , Such as Iceberg The official website provides... As shown in the figure :

Data Lake （ Nine ）：Iceberg Feature details and data types _ nesting

Data Lake （ Nine ）：Iceberg Feature details and data types _ data _02

In the figure booking_table surface 2008 The year is divided by month , Get into 2009 Years later, it will be divided into districts by day , These two partition policies coexist in the table . Thanks to the Iceberg Hidden partitions (Hidden Partition), For SQL Inquire about , Don't need to SQL Partition filtering conditions are specially specified in （ By month or by day ）, Iceberg Will automatically partition , Filter out unwanted data .

5、 Column order evolution （Sort Order Evolution）

Iceberg You can modify the sorting policy on an existing table . After modifying the sorting policy , The old data still adopts the old sorting strategy . Go to Iceberg The computing engine that writes the data will always choose the latest sorting strategy , But when sorting is extremely expensive , No sorting .

Two 、Iceberg data type

Iceberg Table supports the following data types ：

type	describe	Be careful
boolean	Boolean type ,true perhaps false
int	32 Bit signed shaping	It can be converted into long type
long	64 Bit signed shaping
float	Single precision floating point	It can be converted into double type
double	Double precision floating point
decimal(P,S)	decimal(P,S)	P Represents precision , Determine the total number of digits ,S On behalf of scale , Determine the number of decimal places .P Must be less than or equal to 38.
date	date , Time and time zone are not included
time	Time , Excluding date and time zone	Store in microseconds ,1000 Microsecond = 1 millisecond
timestamp	Without time zone timestamp	Store in microseconds ,1000 Microsecond = 1 millisecond
timestamptz	With time zone timestamp	Store in microseconds ,1000 Microsecond = 1 millisecond
string	Any length string type	UTF-8 code
fixed(L)	The length is L Fixed length byte array of
binary	An array of bytes of any length
struct<...>	A structured field consisting of any data type
list<E>	Any data type List
map<K,V>	Of any type K,V Of Map