当前位置：网站首页>Data Lake (13): spark and iceberg integrate DDL operations

Create table establish Iceberg surface , Creating tables can not only create ordinary tables, but also create partitioned tables , When inserting a batch of data into the partition table , The partitioned columns in the data must be sorted , Otherwise, a file closing error will appear , The code is as follows ：

val spark: SparkSession = SparkSession.builder().master("local").appName("SparkOperateIceberg")
  // Appoint hadoop catalog,catalog The name is hadoop_prod
  .config("spark.sql.catalog.hadoop_prod", "org.apache.iceberg.spark.SparkCatalog")
  .config("spark.sql.catalog.hadoop_prod.type", "hadoop")
  .config("spark.sql.catalog.hadoop_prod.warehouse", "hdfs://mycluster/sparkoperateiceberg")
  .getOrCreate()

// Create a normal table 
spark.sql(
  """
    | create table if not exists hadoop_prod.default.normal_tbl(id int,name string,age int) using iceberg
  """.stripMargin)

// Create a partition table , With  loc  Column as partition field 
spark.sql(
  """
    |create table if not exists hadoop_prod.default.partition_tbl(id int,name string,age int,loc string) using iceberg partitioned by (loc)
  """.stripMargin)

// When inserting data into a partitioned table , Partition columns must be sorted , Otherwise, the report will be wrong ：java.lang.IllegalStateException: Already closed files for partition:xxx
spark.sql(
  """
    |insert into table hadoop_prod.default.partition_tbl values (1,"zs",18,"beijing"),(3,"ww",20,"beijing"),(2,"ls",19,"shanghai"),(4,"ml",21,"shagnhai")
  """.stripMargin)
spark.sql("select * from hadoop_prod.default.partition_tbl").show()

The query results are as follows ：

establish Iceberg In zoning , You can also use some conversion expressions to timestamp Column to convert , establish Hide partition , Common conversion expressions are as follows ：

years(ts): By year

// Create a partition table  partition_tbl1 , Specify partition as year
spark.sql(
  """
    |create table if not exists hadoop_prod.default.partition_tbl1(id int ,name string,age int,regist_ts timestamp) using iceberg
    |partitioned by (years(regist_ts))
  """.stripMargin)

// Insert data into table , Be careful , The inserted data needs to be sorted in advance , It has to be sorted , As long as the data of the same date are written together 
//(1,'zs',18,1608469830) --"2020-12-20 21:10:30"
//(2,'ls',19,1634559630) --"2021-10-18 20:20:30"
//(3,'ww',20,1603096230) --"2020-10-19 16:30:30"
//(4,'ml',21,1639920630) --"2021-12-19 21:30:30"
//(5,'tq',22,1608279630) --"2020-12-18 16:20:30"
//(6,'gb',23,1576843830) --"2019-12-20 20:10:30"
spark.sql(
  """
    |insert into hadoop_prod.default.partition_tbl1 values
    |(1,'zs',18,cast(1608469830 as timestamp)),
    |(3,'ww',20,cast(1603096230 as timestamp)),
    |(5,'tq',22,cast(1608279630 as timestamp)),
    |(2,'ls',19,cast(1634559630 as timestamp)),
    |(4,'ml',21,cast(1639920630 as timestamp)),
    |(6,'gb',23,cast(1576843830 as timestamp))
  """.stripMargin)

// Query results 
spark.sql(
  """
    |select * from hadoop_prod.default.partition_tbl1
  """.stripMargin).show()

The results are as follows ：

stay HDFS Middle is divided by year ：

months(ts): according to “ year - month ” Monthly level partition

// Create a partition table  partition_tbl2 , Specify partition as months, According to “ year - month ” Partition 
spark.sql(
  """
    |create table if not exists hadoop_prod.default.partition_tbl2(id int ,name string,age int,regist_ts timestamp) using iceberg
    |partitioned by (months(regist_ts))
  """.stripMargin)

// Insert data into table , Be careful , The inserted data needs to be sorted in advance , It has to be sorted , As long as the data of the same date are written together 
//(1,'zs',18,1608469830) --"2020-12-20 21:10:30"
//(2,'ls',19,1634559630) --"2021-10-18 20:20:30"
//(3,'ww',20,1603096230) --"2020-10-19 16:30:30"
//(4,'ml',21,1639920630) --"2021-12-19 21:30:30"
//(5,'tq',22,1608279630) --"2020-12-18 16:20:30"
//(6,'gb',23,1576843830) --"2019-12-20 20:10:30"
spark.sql(
  """
    |insert into hadoop_prod.default.partition_tbl2 values
    |(1,'zs',18,cast(1608469830 as timestamp)),
    |(5,'tq',22,cast(1608279630 as timestamp)),
    |(2,'ls',19,cast(1634559630 as timestamp)),
    |(3,'ww',20,cast(1603096230 as timestamp)),
    |(4,'ml',21,cast(1639920630 as timestamp)),
    |(6,'gb',23,cast(1576843830 as timestamp))
  """.stripMargin)

// Query results 
spark.sql(
  """
    |select * from hadoop_prod.default.partition_tbl2
  """.stripMargin).show()

The results are as follows ：

stay HDFS Is in accordance with the “ year - month ” partition ：

days(ts) perhaps date(ts): according to “ year - month - Japan ” Day level partition

// Create a partition table  partition_tbl3 , Specify partition as  days, According to “ year - month - Japan ” Partition 
spark.sql(
  """
    |create table if not exists hadoop_prod.default.partition_tbl3(id int ,name string,age int,regist_ts timestamp) using iceberg
    |partitioned by (days(regist_ts))
  """.stripMargin)

// Insert data into table , Be careful , The inserted data needs to be sorted in advance , It has to be sorted , As long as the data of the same date are written together 
//(1,'zs',18,1608469830) --"2020-12-20 21:10:30"
//(2,'ls',19,1634559630) --"2021-10-18 20:20:30"
//(3,'ww',20,1603096230) --"2020-10-19 16:30:30"
//(4,'ml',21,1639920630) --"2021-12-19 21:30:30"
//(5,'tq',22,1608279630) --"2020-12-18 16:20:30"
//(6,'gb',23,1576843830) --"2019-12-20 20:10:30"
spark.sql(
  """
    |insert into hadoop_prod.default.partition_tbl3 values
    |(1,'zs',18,cast(1608469830 as timestamp)),
    |(5,'tq',22,cast(1608279630 as timestamp)),
    |(2,'ls',19,cast(1634559630 as timestamp)),
    |(3,'ww',20,cast(1603096230 as timestamp)),
    |(4,'ml',21,cast(1639920630 as timestamp)),
    |(6,'gb',23,cast(1576843830 as timestamp))
  """.stripMargin)

// Query results 
spark.sql(
  """
    |select * from hadoop_prod.default.partition_tbl3
  """.stripMargin).show()

The results are as follows ：

stay HDFS Is in accordance with the “ year - month - Japan ” partition ：

hours(ts) perhaps date_hour(ts): according to “ year - month - Japan - when ” Hour level partition

// Create a partition table  partition_tbl4 , Specify partition as  hours, According to “ year - month - Japan - when ” Partition 
spark.sql(
  """
    |create table if not exists hadoop_prod.default.partition_tbl4(id int ,name string,age int,regist_ts timestamp) using iceberg
    |partitioned by (hours(regist_ts))
  """.stripMargin)

// Insert data into table , Be careful , The inserted data needs to be sorted in advance , It has to be sorted , As long as the data of the same date are written together 
//(1,'zs',18,1608469830) --"2020-12-20 21:10:30"
//(2,'ls',19,1634559630) --"2021-10-18 20:20:30"
//(3,'ww',20,1603096230) --"2020-10-19 16:30:30"
//(4,'ml',21,1639920630) --"2021-12-19 21:30:30"
//(5,'tq',22,1608279630) --"2020-12-18 16:20:30"
//(6,'gb',23,1576843830) --"2019-12-20 20:10:30"
spark.sql(
  """
    |insert into hadoop_prod.default.partition_tbl4 values
    |(1,'zs',18,cast(1608469830 as timestamp)),
    |(5,'tq',22,cast(1608279630 as timestamp)),
    |(2,'ls',19,cast(1634559630 as timestamp)),
    |(3,'ww',20,cast(1603096230 as timestamp)),
    |(4,'ml',21,cast(1639920630 as timestamp)),
    |(6,'gb',23,cast(1576843830 as timestamp))
  """.stripMargin)

// Query results 
spark.sql(
  """
    |select * from hadoop_prod.default.partition_tbl4
  """.stripMargin).show()

The results are as follows ：

stay HDFS Is in accordance with the “ year - month - Japan - when ” partition ：

Iceberg The supported time partitions currently and in the future only support UTC,UTC It's international time ,UTC+8 It's international time plus eight hours , It's East eighth District time , It's Beijing time , So we can see that the partition time above is inconsistent with the data time .

In addition to the above commonly used time hidden partitions ,Iceberg And support bucket(N,col) Partition , This partitioning method can be based on hash Value and N The remainder determines the partition to which the data goes .truncate(L,col), This hidden partition can intercept the string of characters L length , The same data will be divided into the same partition .

Two 、CREATE TAEBL ... AS SELECT

Iceberg Support “create table .... as select ” grammar , You can create a table from the query statement , And insert the corresponding data , The operation is as follows ：

1、 Create table hadoop_prod.default.mytbl, And insert data

val spark: SparkSession = SparkSession.builder().master("local").appName("SparkOperateIceberg")
  // Appoint hadoop catalog,catalog The name is hadoop_prod
  .config("spark.sql.catalog.hadoop_prod", "org.apache.iceberg.spark.SparkCatalog")
  .config("spark.sql.catalog.hadoop_prod.type", "hadoop")
  .config("spark.sql.catalog.hadoop_prod.warehouse", "hdfs://mycluster/sparkoperateiceberg")
  .getOrCreate()

// Create a normal table 
spark.sql(
  """
    | create table hadoop_prod.default.mytbl(id int,name string,age int) using iceberg
  """.stripMargin)

// Insert data into table 
spark.sql(
  """
    |insert into table hadoop_prod.default.mytbl values (1,"zs",18),(3,"ww",20),(2,"ls",19),(4,"ml",21)
  """.stripMargin)

// Query data 
spark.sql("select * from hadoop_prod.default.mytbl").show()

2、 Use “create table ... as select” Syntax create table mytal2 And query

spark.sql(
  """
    |create table hadoop_prod.default.mytbl2 using iceberg as select id,name,age from hadoop_prod.default.mytbl
  """.stripMargin)
spark.sql(
  """
    |select * from hadoop_prod.default.mytbl2
  """.stripMargin).show()

give the result as follows ：

3、 ... and 、REPLACE TABLE ... AS SELECT

Iceberg Support “replace table .... as select ” grammar , You can rebuild a table from the query statement , And insert the corresponding data , The operation is as follows ：

1、 Create table “hadoop_prod.default.mytbl3”, And insert data 、 Exhibition

spark.sql(
  """
    |create table hadoop_prod.default.mytbl3 (id int,name string,loc string,score int) using iceberg
  """.stripMargin)
spark.sql(
  """
    |insert into table hadoop_prod.default.mytbl3 values (1,"zs","beijing",100),(2,"ls","shanghai",200)
  """.stripMargin)
spark.sql(
  """
    |select * from hadoop_prod.default.mytbl3
  """.stripMargin).show

2、 Rebuild table “hadoop_prod.default.mytbl3”, And insert the corresponding data

spark.sql(
  """
    |replace table hadoop_prod.default.mytbl2 using iceberg as select * from hadoop_prod.default.mytbl3
  """.stripMargin)

spark.sql(
  """
    |select * from hadoop_prod.default.mytbl2
  """.stripMargin).show()

Four 、DROP TABLE

Delete iceberg Table is executed directly :“drop table xxx” Sentence can be used , When deleting a table , The table data will be deleted , But the library directory exists .

// Delete table 
spark.sql(
  """
    |drop table hadoop_prod.default.mytbl
  """.stripMargin)

5、 ... and 、ALTER TABLE

Iceberg Of alter Operation in Spark3.x Version supports ,alter It generally includes the following operations ：

add to 、 Delete column

Add column operation ：ALTER TABLE ... ADD COLUMN

Delete column operation ：ALTER TABLE ... DROP COLUMN

//1. Create table test, And insert data 、 Inquire about 
spark.sql(
  """
    |create table hadoop_prod.default.test(id int,name string,age int) using iceberg
  """.stripMargin)
spark.sql(
  """
    |insert into table hadoop_prod.default.test values (1,"zs",18),(2,"ls",19),(3,"ww",20)
  """.stripMargin)
spark.sql(
  """
    | select * from hadoop_prod.default.test
  """.stripMargin).show()


//2. Add fields , to  test Table increase  gender  Column 、loc Column 
spark.sql(
  """
    |alter table hadoop_prod.default.test add column gender string,loc string
  """.stripMargin)

//3. Delete field , to test  Table delete age  Column 
spark.sql(
  """
    |alter table hadoop_prod.default.test drop column age
  """.stripMargin)

//4. See the table test data 
spark.sql(
  """
    |select * from hadoop_prod.default.test
  """.stripMargin).show()

The final table shows fewer columns age Column , More gender、loc Column ：

To be ranked high

Rename the syntax ：ALTER TABLE ... RENAME COLUMN, The operation is as follows :

//5. To be ranked high 
spark.sql(
  """
    |alter table hadoop_prod.default.test rename column gender to xxx
    |
  """.stripMargin)
spark.sql(
  """
    |select * from hadoop_prod.default.test
  """.stripMargin).show()

The columns shown in the final table gender The column becomes xxx Column ：

6、 ... and 、ALTER TABLE Partition operation

alter Partition operations include adding partitions and deleting partitions , This partition operation is in Spark3.x Later supported ,spark2.4 Version not supported , And when you use it , Must be in spark Add... To the configuration spark.sql.extensions attribute , Its value is ：org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions, Partition conversion is also supported when adding partitions , The grammar is as follows ：

Add partition syntax ：ALTER TABLE ... ADD PARTITION FIELD
Delete partition Syntax ：ALTER TABLE ... DROP PARTITION FIELD

The specific operation is as follows ：