当前位置：网站首页>Spark 3.0 testing and use

Spark 3.0 testing and use

2022-07-27 15:37:00 【wankunde】

Compatibility with hadoop and hive

Spark 3.0 Officially supported by default Hadoop The minimum version is 2.7, Hive The minimum version is 1.2. Our platform uses CDH 5.13, The corresponding versions are hadoop-2.6.0, hive-1.1.0. So try to compile it yourself Spark 3.0 To use .

Compile environment ： Maven 3.6.3, Java 8, Scala 2.12

Hive Version precompiled

because Hive 1.1.0 It's too long , Many rely on packages and Spark3.0 In the incompatible , Need to recompile .
commons-lang3 The package version is too old , The lack of JAVA_9 The above support

hive exec Module compilation : mvn clean package install -DskipTests -pl ql -am -Phadoop-2

Code changes :

diff --git a/pom.xml b/pom.xml
index 5d14dc4..889b960 100644
--- a/pom.xml
+++ b/pom.xml
@@ -248,6 +248,14 @@
          <enabled>false</enabled>
        </snapshots>
     </repository>
+    <repository>
+      <id>spring</id>
+      <name>Spring repo</name>
+      <url>https://repo.spring.io/plugins-release/</url>
+      <releases>
+        <enabled>true</enabled>
+      </releases>
+    </repository>
   </repositories>

   <!-- Hadoop dependency management is done at the bottom under profiles -->
@@ -982,6 +990,9 @@
   <profiles>
     <profile>
       <id>thriftif</id>
+      <properties>
+        <thrift.home>/usr/local/opt/[email protected]</thrift.home>
+      </properties>
       <build>
         <plugins>
           <plugin>
diff --git a/ql/pom.xml b/ql/pom.xml
index 0c5e91f..101ef11 100644
--- a/ql/pom.xml
+++ b/ql/pom.xml
@@ -736,10 +736,7 @@
                   <include>org.apache.hive:hive-exec</include>
                   <include>org.apache.hive:hive-serde</include>
                   <include>com.esotericsoftware.kryo:kryo</include>
-                  <include>com.twitter:parquet-hadoop-bundle</include>
-                  <include>org.apache.thrift:libthrift</include>
                   <include>commons-lang:commons-lang</include>
-                  <include>org.apache.commons:commons-lang3</include>
                   <include>org.jodd:jodd-core</include>
                   <include>org.json:json</include>
                   <include>org.apache.avro:avro</include>

Spark compile

# Apache
git clone [email protected]:apache/spark.git
git checkout v3.0.0

# Leyan Version  Main design spark hive Compatibility transformation 
git clone [email protected]:HDP/spark.git
git checkout -b v3.0.0_cloudera origin/v3.0.0_cloudera

./dev/make-distribution.sh --name cloudera --tgz -DskipTests -Phive -Phive-thriftserver -Pyarn -Pcdhhive

-- Local warehouse update 
mvn clean install -DskipTests=true -Phive -Phive-thriftserver -Pyarn -Pcdhhive

# deploy

rm -rf /opt/spark-3.0.0-bin-cloudera
tar -zxvf spark-3.0.0-bin-cloudera.tgz
rm -rf /opt/spark-3.0.0-bin-cloudera/conf
ln -s /etc/spark3/conf /opt/spark-3.0.0-bin-cloudera/conf

cd /opt/spark/jars
zip spark-3.0.0-jars.zip ./*
HADOOP_USER_NAME=hdfs hdfs dfs -put -f spark-3.0.0-jars.zip hdfs:///deploy/config/spark-3.0.0-jars.zip
rm spark-3.0.0-jars.zip

# add config : spark.yarn.archive=hdfs:///deploy/config/spark-3.0.0-jars.zip

Configuration

#  Turn on AE Pattern 
spark.sql.adaptive.enabled=true

# spark Generate parquet Files use legacy Pattern , Otherwise, the generated file cannot be Hive Or other components read 
spark.sql.parquet.writeLegacyFormat=true

#  Compatible with the use Spark2 Of  external shuffle service
spark.shuffle.useOldFetchProtocol=true
spark.sql.storeAssignmentPolicy=LEGACY

#  Now default datasource v2 It is not supported that the data source table and the target table are the same table , Skip verification with the following parameters 
#spark.sql.hive.convertInsertingPartitionedTable=false
spark.sql.sources.partitionOverwriteVerifyPath=false

Tips

In the use of Maven When compiling , Previous versions supported many CPU Concurrent compilation , Not now , Otherwise, it will cause deadlock when compiling
In the use of maven Command for compilation cannot be specified at the same time package and install, Otherwise, there will be conflicts during compilation
Template compilation commands mvn clean install -DskipTests -Phive -Phive-thriftserver -Pyarn -DskipTests -Pcdhhive, You can customize the compilation module and compilation target
Want to use Spark3.0, It still needs magic modification .yarn The module needs to be changed slightly . mvn clean install -DskipTests=true -pl resource-managers/yarn -am -Phive -Phive-thriftserver -Pyarn -Pcdhhive
all Spark3.0 After all packages are installed locally , You can continue to compile above-board project
Delete Spark3.0 Medium to high version hive Support for
When you switch to CDH Of hive Version found , The hive edition shade Of commons jar It's too old , To repack

TroubleShooting

Need to update local Hive-exec Package dependency , Less time to pack shade thrift The code of the package

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 5, prd-zboffline-044.prd.leyantech.com, executor 1): java.lang.NoSuchMethodError: shaded.parquet.org.apache.thrift.EncodingUtils.setBit(BIZ)B
	at org.apache.parquet.format.FileMetaData.setVersionIsSet(FileMetaData.java:349)
	at org.apache.parquet.format.FileMetaData.setVersion(FileMetaData.java:335)
	at org.apache.parquet.format.Util$DefaultFileMetaDataConsumer.setVersion(Util.java:122)
	at org.apache.parquet.format.Util$5.consume(Util.java:161)
	at org.apache.parquet.format.event.TypedConsumer$I32Consumer.read(TypedConsumer.java:78)

原网站

版权声明
本文为[wankunde]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/208/202207271423227709.html