当前位置:网站首页>Spark 3.0 testing and use
Spark 3.0 testing and use
2022-07-27 15:37:00 【wankunde】
Compatibility with hadoop and hive
Spark 3.0 Officially supported by default Hadoop The minimum version is 2.7, Hive The minimum version is 1.2. Our platform uses CDH 5.13, The corresponding versions are hadoop-2.6.0, hive-1.1.0. So try to compile it yourself Spark 3.0 To use .
Compile environment : Maven 3.6.3, Java 8, Scala 2.12
Hive Version precompiled
because Hive 1.1.0 It's too long , Many rely on packages and Spark3.0 In the incompatible , Need to recompile .
commons-lang3 The package version is too old , The lack of JAVA_9 The above support
hive exec Module compilation : mvn clean package install -DskipTests -pl ql -am -Phadoop-2
Code changes :
diff --git a/pom.xml b/pom.xml
index 5d14dc4..889b960 100644
--- a/pom.xml
+++ b/pom.xml
@@ -248,6 +248,14 @@
<enabled>false</enabled>
</snapshots>
</repository>
+ <repository>
+ <id>spring</id>
+ <name>Spring repo</name>
+ <url>https://repo.spring.io/plugins-release/</url>
+ <releases>
+ <enabled>true</enabled>
+ </releases>
+ </repository>
</repositories>
<!-- Hadoop dependency management is done at the bottom under profiles -->
@@ -982,6 +990,9 @@
<profiles>
<profile>
<id>thriftif</id>
+ <properties>
+ <thrift.home>/usr/local/opt/[email protected]</thrift.home>
+ </properties>
<build>
<plugins>
<plugin>
diff --git a/ql/pom.xml b/ql/pom.xml
index 0c5e91f..101ef11 100644
--- a/ql/pom.xml
+++ b/ql/pom.xml
@@ -736,10 +736,7 @@
<include>org.apache.hive:hive-exec</include>
<include>org.apache.hive:hive-serde</include>
<include>com.esotericsoftware.kryo:kryo</include>
- <include>com.twitter:parquet-hadoop-bundle</include>
- <include>org.apache.thrift:libthrift</include>
<include>commons-lang:commons-lang</include>
- <include>org.apache.commons:commons-lang3</include>
<include>org.jodd:jodd-core</include>
<include>org.json:json</include>
<include>org.apache.avro:avro</include>
Spark compile
# Apache
git clone [email protected]:apache/spark.git
git checkout v3.0.0
# Leyan Version Main design spark hive Compatibility transformation
git clone [email protected]:HDP/spark.git
git checkout -b v3.0.0_cloudera origin/v3.0.0_cloudera
./dev/make-distribution.sh --name cloudera --tgz -DskipTests -Phive -Phive-thriftserver -Pyarn -Pcdhhive
-- Local warehouse update
mvn clean install -DskipTests=true -Phive -Phive-thriftserver -Pyarn -Pcdhhive
# deploy
rm -rf /opt/spark-3.0.0-bin-cloudera
tar -zxvf spark-3.0.0-bin-cloudera.tgz
rm -rf /opt/spark-3.0.0-bin-cloudera/conf
ln -s /etc/spark3/conf /opt/spark-3.0.0-bin-cloudera/conf
cd /opt/spark/jars
zip spark-3.0.0-jars.zip ./*
HADOOP_USER_NAME=hdfs hdfs dfs -put -f spark-3.0.0-jars.zip hdfs:///deploy/config/spark-3.0.0-jars.zip
rm spark-3.0.0-jars.zip
# add config : spark.yarn.archive=hdfs:///deploy/config/spark-3.0.0-jars.zip
Configuration
# Turn on AE Pattern
spark.sql.adaptive.enabled=true
# spark Generate parquet Files use legacy Pattern , Otherwise, the generated file cannot be Hive Or other components read
spark.sql.parquet.writeLegacyFormat=true
# Compatible with the use Spark2 Of external shuffle service
spark.shuffle.useOldFetchProtocol=true
spark.sql.storeAssignmentPolicy=LEGACY
# Now default datasource v2 It is not supported that the data source table and the target table are the same table , Skip verification with the following parameters
#spark.sql.hive.convertInsertingPartitionedTable=false
spark.sql.sources.partitionOverwriteVerifyPath=false
Tips
- In the use of Maven When compiling , Previous versions supported many CPU Concurrent compilation , Not now , Otherwise, it will cause deadlock when compiling
- In the use of maven Command for compilation cannot be specified at the same time
packageandinstall, Otherwise, there will be conflicts during compilation - Template compilation commands
mvn clean install -DskipTests -Phive -Phive-thriftserver -Pyarn -DskipTests -Pcdhhive, You can customize the compilation module and compilation target - Want to use Spark3.0, It still needs magic modification .
yarnThe module needs to be changed slightly .mvn clean install -DskipTests=true -pl resource-managers/yarn -am -Phive -Phive-thriftserver -Pyarn -Pcdhhive - all Spark3.0 After all packages are installed locally , You can continue to compile above-board project
- Delete Spark3.0 Medium to high version hive Support for
- When you switch to CDH Of hive Version found , The hive edition shade Of commons jar It's too old , To repack
TroubleShooting
Need to update local Hive-exec Package dependency , Less time to pack shade thrift The code of the package
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 5, prd-zboffline-044.prd.leyantech.com, executor 1): java.lang.NoSuchMethodError: shaded.parquet.org.apache.thrift.EncodingUtils.setBit(BIZ)B
at org.apache.parquet.format.FileMetaData.setVersionIsSet(FileMetaData.java:349)
at org.apache.parquet.format.FileMetaData.setVersion(FileMetaData.java:335)
at org.apache.parquet.format.Util$DefaultFileMetaDataConsumer.setVersion(Util.java:122)
at org.apache.parquet.format.Util$5.consume(Util.java:161)
at org.apache.parquet.format.event.TypedConsumer$I32Consumer.read(TypedConsumer.java:78)
边栏推荐
- QT (IV) mixed development using code and UI files
- Network equipment hard core technology insider router Chapter 16 dpdk and its prequel (I)
- C语言:三子棋游戏
- C语言中交换两数的方法
- Network equipment hard core technology insider router 20 dpdk (V)
- Introduction to STM32 learning can controller
- Leetcode 244 week competition - post competition supplementary question solution [broccoli players]
- Implement custom spark optimization rules
- 实现自定义Spark优化规则
- 【剑指offer】面试题53-Ⅰ:在排序数组中查找数字1 —— 二分查找的三个模版
猜你喜欢

Spark 3.0 DPP实现逻辑

How to take satisfactory photos / videos from hololens

Leetcode 781. rabbit hash table in forest / mathematical problem medium

The design method of integral operation circuit is introduced in detail
![[0 basic operations research] [super detail] column generation](/img/cd/f2521824c9ef6a50ec2be307c584ca.png)
[0 basic operations research] [super detail] column generation

How to edit a framework resource file separately

Leetcode interview question 17.21. water volume double pointer of histogram, monotonic stack /hard

Leetcode 190. reverse binary bit operation /easy

【剑指offer】面试题45:把数组排成最小的数

npm install错误 unable to access
随机推荐
Explanation of various attributes of "router link"
Network equipment hard core technology insider router Chapter 5 tompkinson roaming the network world (Part 1)
flutter —— 布局原理与约束
Leetcode 81. search rotation sort array II binary /medium
C language: factorial recursive implementation of numbers
后台返回来的是这种数据,是什么格式啊
初探JuiceFS
Spark3中Catalog组件设计和自定义扩展Catalog实现
Leetcode 456.132 mode monotone stack /medium
EMC design scheme of USB2.0 Interface
Leetcode 783. binary search tree node minimum distance tree /easy
$router.back(-1)
Selenium reports an error: session not created: this version of chromedriver only supports chrome version 81
【剑指offer】面试题39:数组中出现次数超过一半的数字
What format is this data returned from the background
Spark 3.0 Adaptive Execution 代码实现及数据倾斜优化
C:什么是函数中的返回值(转)
Spark Bucket Table Join
Four kinds of relay schemes driven by single chip microcomputer
Inside router of network equipment hard core technology (10) disassembly of Cisco asr9900 (4)