当前位置:网站首页>Spark 3.0 testing and use
Spark 3.0 testing and use
2022-07-27 15:37:00 【wankunde】
Compatibility with hadoop and hive
Spark 3.0 Officially supported by default Hadoop The minimum version is 2.7, Hive The minimum version is 1.2. Our platform uses CDH 5.13, The corresponding versions are hadoop-2.6.0, hive-1.1.0. So try to compile it yourself Spark 3.0 To use .
Compile environment : Maven 3.6.3, Java 8, Scala 2.12
Hive Version precompiled
because Hive 1.1.0 It's too long , Many rely on packages and Spark3.0 In the incompatible , Need to recompile .
commons-lang3 The package version is too old , The lack of JAVA_9 The above support
hive exec Module compilation : mvn clean package install -DskipTests -pl ql -am -Phadoop-2
Code changes :
diff --git a/pom.xml b/pom.xml
index 5d14dc4..889b960 100644
--- a/pom.xml
+++ b/pom.xml
@@ -248,6 +248,14 @@
<enabled>false</enabled>
</snapshots>
</repository>
+ <repository>
+ <id>spring</id>
+ <name>Spring repo</name>
+ <url>https://repo.spring.io/plugins-release/</url>
+ <releases>
+ <enabled>true</enabled>
+ </releases>
+ </repository>
</repositories>
<!-- Hadoop dependency management is done at the bottom under profiles -->
@@ -982,6 +990,9 @@
<profiles>
<profile>
<id>thriftif</id>
+ <properties>
+ <thrift.home>/usr/local/opt/[email protected]</thrift.home>
+ </properties>
<build>
<plugins>
<plugin>
diff --git a/ql/pom.xml b/ql/pom.xml
index 0c5e91f..101ef11 100644
--- a/ql/pom.xml
+++ b/ql/pom.xml
@@ -736,10 +736,7 @@
<include>org.apache.hive:hive-exec</include>
<include>org.apache.hive:hive-serde</include>
<include>com.esotericsoftware.kryo:kryo</include>
- <include>com.twitter:parquet-hadoop-bundle</include>
- <include>org.apache.thrift:libthrift</include>
<include>commons-lang:commons-lang</include>
- <include>org.apache.commons:commons-lang3</include>
<include>org.jodd:jodd-core</include>
<include>org.json:json</include>
<include>org.apache.avro:avro</include>
Spark compile
# Apache
git clone [email protected]:apache/spark.git
git checkout v3.0.0
# Leyan Version Main design spark hive Compatibility transformation
git clone [email protected]:HDP/spark.git
git checkout -b v3.0.0_cloudera origin/v3.0.0_cloudera
./dev/make-distribution.sh --name cloudera --tgz -DskipTests -Phive -Phive-thriftserver -Pyarn -Pcdhhive
-- Local warehouse update
mvn clean install -DskipTests=true -Phive -Phive-thriftserver -Pyarn -Pcdhhive
# deploy
rm -rf /opt/spark-3.0.0-bin-cloudera
tar -zxvf spark-3.0.0-bin-cloudera.tgz
rm -rf /opt/spark-3.0.0-bin-cloudera/conf
ln -s /etc/spark3/conf /opt/spark-3.0.0-bin-cloudera/conf
cd /opt/spark/jars
zip spark-3.0.0-jars.zip ./*
HADOOP_USER_NAME=hdfs hdfs dfs -put -f spark-3.0.0-jars.zip hdfs:///deploy/config/spark-3.0.0-jars.zip
rm spark-3.0.0-jars.zip
# add config : spark.yarn.archive=hdfs:///deploy/config/spark-3.0.0-jars.zip
Configuration
# Turn on AE Pattern
spark.sql.adaptive.enabled=true
# spark Generate parquet Files use legacy Pattern , Otherwise, the generated file cannot be Hive Or other components read
spark.sql.parquet.writeLegacyFormat=true
# Compatible with the use Spark2 Of external shuffle service
spark.shuffle.useOldFetchProtocol=true
spark.sql.storeAssignmentPolicy=LEGACY
# Now default datasource v2 It is not supported that the data source table and the target table are the same table , Skip verification with the following parameters
#spark.sql.hive.convertInsertingPartitionedTable=false
spark.sql.sources.partitionOverwriteVerifyPath=false
Tips
- In the use of Maven When compiling , Previous versions supported many CPU Concurrent compilation , Not now , Otherwise, it will cause deadlock when compiling
- In the use of maven Command for compilation cannot be specified at the same time
packageandinstall, Otherwise, there will be conflicts during compilation - Template compilation commands
mvn clean install -DskipTests -Phive -Phive-thriftserver -Pyarn -DskipTests -Pcdhhive, You can customize the compilation module and compilation target - Want to use Spark3.0, It still needs magic modification .
yarnThe module needs to be changed slightly .mvn clean install -DskipTests=true -pl resource-managers/yarn -am -Phive -Phive-thriftserver -Pyarn -Pcdhhive - all Spark3.0 After all packages are installed locally , You can continue to compile above-board project
- Delete Spark3.0 Medium to high version hive Support for
- When you switch to CDH Of hive Version found , The hive edition shade Of commons jar It's too old , To repack
TroubleShooting
Need to update local Hive-exec Package dependency , Less time to pack shade thrift The code of the package
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 5, prd-zboffline-044.prd.leyantech.com, executor 1): java.lang.NoSuchMethodError: shaded.parquet.org.apache.thrift.EncodingUtils.setBit(BIZ)B
at org.apache.parquet.format.FileMetaData.setVersionIsSet(FileMetaData.java:349)
at org.apache.parquet.format.FileMetaData.setVersion(FileMetaData.java:335)
at org.apache.parquet.format.Util$DefaultFileMetaDataConsumer.setVersion(Util.java:122)
at org.apache.parquet.format.Util$5.consume(Util.java:161)
at org.apache.parquet.format.event.TypedConsumer$I32Consumer.read(TypedConsumer.java:78)
边栏推荐
- JS find the maximum and minimum values in the array (math.max() method)
- Network equipment hard core technology insider router Chapter 5 tompkinson roaming the network world (Part 1)
- C语言:三子棋游戏
- Spark RPC
- [TensorBoard] OSError: [Errno 22] Invalid argument处理
- Implement custom spark optimization rules
- Dan bin Investment Summit: on the importance of asset management!
- Leetcode 341. flattened nested list iterator DFS, stack / medium
- 股票开户佣金优惠,炒股开户哪家证券公司好网上开户安全吗
- Leetcode 456.132 mode monotone stack /medium
猜你喜欢

C语言:数据的存储

Photoelectric isolation circuit design scheme (six photoelectric isolation circuit diagrams based on optocoupler and ad210an)

Watermelon book machine learning reading notes Chapter 1 Introduction

初探JuiceFS

C语言:三子棋游戏

Record record record

QT (IV) mixed development using code and UI files

【剑指offer】面试题54:二叉搜索树的第k大节点

After configuring corswebfilter in grain mall, an error is reported: resource sharing error:multiplealloworiginvalues

/dev/loop1占用100%问题
随机推荐
flutter —— 布局原理与约束
Network equipment hard core technology insider router 20 dpdk (V)
[正则表达式] 单个字符匹配
How to edit a framework resource file separately
Troubleshooting the slow startup of spark local programs
[系统编程] 进程,线程问题总结
Jump to the specified position when video continues playing
《吐血整理》C#一些常用的帮助类
Multi table query_ Exercise 1 & Exercise 2 & Exercise 3
Spark TroubleShooting整理
Discussion on STM32 power down reset PDR
Transactions_ Basic demonstrations and transactions_ Default auto submit & manual submit
Leetcode 456.132 mode monotone stack /medium
C语言:数据的存储
Database: use the where statement to retrieve (header song)
How to package AssetBundle
MLX90640 红外热成像仪测温传感器模块开发笔记(七)
后台返回来的是这种数据,是什么格式啊
设置提示框位置随鼠标移动,并解决提示框显示不全的问题
Overview of wechat public platform development