当前位置：网站首页>Pyspark writes data to iceberg

Pyspark writes data to iceberg

2022-07-28 08:34:00 【Lu Xinhang】

pyspark Environment building

1.D:\Python\python37\Lib\site-packages\pyspark\jars
Put in
iceberg-spark3-runtime-0.13.1.jar
alluxio-2.6.2-client.jar

2.D:\Python\python37\Lib\site-packages\pyspark
establish conf Folder Put in hdfs-site.xml hive-site.xml

Code

 
import os
import warnings

import argparse

from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
from pyspark.sql.types import StructField,StructType,DecimalType,IntegerType,TimestampType, StringType

import pypinyin

warnings.filterwarnings("ignore")

def get_spark():
    os.environ.setdefault('HADOOP_USER_NAME', 'root')
    spark = SparkSession.builder\
        .config('spark.sql.debug.maxToStringFields', 2000) \
        .config('spark.debug.maxToStringFields', 2000) \
    .getOrCreate()
    spark.conf.set("spark.sql.catalog.iceberg", "org.apache.iceberg.spark.SparkCatalog")
    spark.conf.set("spark.sql.catalog.iceberg.type", "hive")
    spark.conf.set("spark.sql.catalog.iceberg.uri", "thrift://192.168.x.xx:9083")


    spark.conf.set("spark.sql.iceberg.handle-timestamp-without-timezone", True)
    # Cannot handle timestamp without timezone fields in Spark. Spark does not natively support this type but if you would like to handle all timestamps as timestamp with timezone set 'spark.sql.iceberg.handle-timestamp-without-timezone' to true

    spark.conf.set("spark.sql.sources.partitionOverwriteMode", "DYNAMIC")

    # spark.conf.set("spark.sql.storeAssignmentPolicy", "LEGACY")
    # https://www.cnblogs.com/songchaolin/p/12098618.html pyspark.sql.utils.AnalysisException: LEGACY store assignment policy is disallowed in Spark data source V2. Please set the configuration spark.sql.storeAssignmentPolicy to other values.

    return spark

def Capitalize_hanzipinyin(word):
     
    return ''

def main_run(dt):
    table_name='iceberg.xxx.xxx'
    target_table_name = 'iceberg.xxx.xxx'
    target_table_name_columns = ['A','B']
    sql = """ select A,B from %s where dt = '%s' """%(table_name, dt)

    

    spark = get_spark()

    spark_df = spark.sql(sql)
    toPinyinUDF = udf(Capitalize_hanzipinyin, StringType())
    spark_df = spark_df.withColumn('A_pinyin', toPinyinUDF('A'))
  
    # soulution 1
    delete_sql = "delete from %s where dt = '%s' "%(target_table_name,dt)
    spark.sql(delete_sql)
    spark_df.write.saveAsTable(target_table_name, None, "append", partitionBy='dt')
    # solution 2
    spark_df.createOrReplaceTempView("test")#： Create a temporary view 
    spark.sql(
        "insert overwrite table %s partition(dt) select A,B,A_pinyin from test" % target_table_name)
    #  Use select *  Will report a mistake  Cannot safely cast '': string to int

    # soulution 3
    new_spark_df = spark.sql("SELECT A,B,A_pinyin from test")
    new_spark_df.write.insertInto(target_table_name, True)

    # solution 4  All the data in the table will be overwritten 
    # new_spark_df.write.saveAsTable(target_table_name, None, "overwrite",partitionBy='dt') 


    # solution 5 spark df  turn  pandas df  Data type matching may fail 
    df = spark_df.toPandas()
    #  This data frame is converted to pandas when , Column type from spark Medium integer Change to pandas Medium float
    df['A_pinyin'] = df['A'].apply(Capitalize_hanzipinyin)
    df = df[target_table_name_columns] # Change position 
    schema = StructType([
                         StructField("A", StringType(), True),
                         ...  
                         ])
    #  Set up scheam field A: IntegerType can not accept object 2.0 in type <class 'float'>
    DF = spark.createDataFrame(df,schema)
    # No, schema ValueError: Some of types cannot be determined after inferring There are fields spark Its type cannot be inferred 
    DF.write.insertInto(target_table_name,  True)

原网站

版权声明
本文为[Lu Xinhang]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/209/202207280629417832.html

当前位置：网站首页>Pyspark writes data to iceberg

Pyspark writes data to iceberg

pyspark Environment building

Code

边栏推荐

猜你喜欢

随机推荐