当前位置:网站首页>Spark 3.0 testing and use
Spark 3.0 testing and use
2022-07-27 15:37:00 【wankunde】
Compatibility with hadoop and hive
Spark 3.0 Officially supported by default Hadoop The minimum version is 2.7, Hive The minimum version is 1.2. Our platform uses CDH 5.13, The corresponding versions are hadoop-2.6.0, hive-1.1.0. So try to compile it yourself Spark 3.0 To use .
Compile environment : Maven 3.6.3, Java 8, Scala 2.12
Hive Version precompiled
because Hive 1.1.0 It's too long , Many rely on packages and Spark3.0 In the incompatible , Need to recompile .
commons-lang3 The package version is too old , The lack of JAVA_9 The above support
hive exec Module compilation : mvn clean package install -DskipTests -pl ql -am -Phadoop-2
Code changes :
diff --git a/pom.xml b/pom.xml
index 5d14dc4..889b960 100644
--- a/pom.xml
+++ b/pom.xml
@@ -248,6 +248,14 @@
<enabled>false</enabled>
</snapshots>
</repository>
+ <repository>
+ <id>spring</id>
+ <name>Spring repo</name>
+ <url>https://repo.spring.io/plugins-release/</url>
+ <releases>
+ <enabled>true</enabled>
+ </releases>
+ </repository>
</repositories>
<!-- Hadoop dependency management is done at the bottom under profiles -->
@@ -982,6 +990,9 @@
<profiles>
<profile>
<id>thriftif</id>
+ <properties>
+ <thrift.home>/usr/local/opt/[email protected]</thrift.home>
+ </properties>
<build>
<plugins>
<plugin>
diff --git a/ql/pom.xml b/ql/pom.xml
index 0c5e91f..101ef11 100644
--- a/ql/pom.xml
+++ b/ql/pom.xml
@@ -736,10 +736,7 @@
<include>org.apache.hive:hive-exec</include>
<include>org.apache.hive:hive-serde</include>
<include>com.esotericsoftware.kryo:kryo</include>
- <include>com.twitter:parquet-hadoop-bundle</include>
- <include>org.apache.thrift:libthrift</include>
<include>commons-lang:commons-lang</include>
- <include>org.apache.commons:commons-lang3</include>
<include>org.jodd:jodd-core</include>
<include>org.json:json</include>
<include>org.apache.avro:avro</include>
Spark compile
# Apache
git clone [email protected]:apache/spark.git
git checkout v3.0.0
# Leyan Version Main design spark hive Compatibility transformation
git clone [email protected]:HDP/spark.git
git checkout -b v3.0.0_cloudera origin/v3.0.0_cloudera
./dev/make-distribution.sh --name cloudera --tgz -DskipTests -Phive -Phive-thriftserver -Pyarn -Pcdhhive
-- Local warehouse update
mvn clean install -DskipTests=true -Phive -Phive-thriftserver -Pyarn -Pcdhhive
# deploy
rm -rf /opt/spark-3.0.0-bin-cloudera
tar -zxvf spark-3.0.0-bin-cloudera.tgz
rm -rf /opt/spark-3.0.0-bin-cloudera/conf
ln -s /etc/spark3/conf /opt/spark-3.0.0-bin-cloudera/conf
cd /opt/spark/jars
zip spark-3.0.0-jars.zip ./*
HADOOP_USER_NAME=hdfs hdfs dfs -put -f spark-3.0.0-jars.zip hdfs:///deploy/config/spark-3.0.0-jars.zip
rm spark-3.0.0-jars.zip
# add config : spark.yarn.archive=hdfs:///deploy/config/spark-3.0.0-jars.zip
Configuration
# Turn on AE Pattern
spark.sql.adaptive.enabled=true
# spark Generate parquet Files use legacy Pattern , Otherwise, the generated file cannot be Hive Or other components read
spark.sql.parquet.writeLegacyFormat=true
# Compatible with the use Spark2 Of external shuffle service
spark.shuffle.useOldFetchProtocol=true
spark.sql.storeAssignmentPolicy=LEGACY
# Now default datasource v2 It is not supported that the data source table and the target table are the same table , Skip verification with the following parameters
#spark.sql.hive.convertInsertingPartitionedTable=false
spark.sql.sources.partitionOverwriteVerifyPath=false
Tips
- In the use of Maven When compiling , Previous versions supported many CPU Concurrent compilation , Not now , Otherwise, it will cause deadlock when compiling
- In the use of maven Command for compilation cannot be specified at the same time
packageandinstall, Otherwise, there will be conflicts during compilation - Template compilation commands
mvn clean install -DskipTests -Phive -Phive-thriftserver -Pyarn -DskipTests -Pcdhhive, You can customize the compilation module and compilation target - Want to use Spark3.0, It still needs magic modification .
yarnThe module needs to be changed slightly .mvn clean install -DskipTests=true -pl resource-managers/yarn -am -Phive -Phive-thriftserver -Pyarn -Pcdhhive - all Spark3.0 After all packages are installed locally , You can continue to compile above-board project
- Delete Spark3.0 Medium to high version hive Support for
- When you switch to CDH Of hive Version found , The hive edition shade Of commons jar It's too old , To repack
TroubleShooting
Need to update local Hive-exec Package dependency , Less time to pack shade thrift The code of the package
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 5, prd-zboffline-044.prd.leyantech.com, executor 1): java.lang.NoSuchMethodError: shaded.parquet.org.apache.thrift.EncodingUtils.setBit(BIZ)B
at org.apache.parquet.format.FileMetaData.setVersionIsSet(FileMetaData.java:349)
at org.apache.parquet.format.FileMetaData.setVersion(FileMetaData.java:335)
at org.apache.parquet.format.Util$DefaultFileMetaDataConsumer.setVersion(Util.java:122)
at org.apache.parquet.format.Util$5.consume(Util.java:161)
at org.apache.parquet.format.event.TypedConsumer$I32Consumer.read(TypedConsumer.java:78)
边栏推荐
- Huawei's general card identification function enables multiple card bindings with one key
- Spark 3.0 DPP实现逻辑
- Leetcode 783. binary search tree node minimum distance tree /easy
- Network device hard core technology insider router Chapter 15 from deer by device to router (Part 2)
- What format is this data returned from the background
- Network equipment hard core technology insider router Chapter 4 Jia Baoyu sleepwalking in Taixu Fantasy (Part 2)
- JS find the maximum and minimum values in the array (math.max() method)
- How to package AssetBundle
- Network equipment hard core technology insider router Chapter 11 Cisco asr9900 disassembly (V)
- Spark 3.0 测试与使用
猜你喜欢

Photoelectric isolation circuit design scheme (six photoelectric isolation circuit diagrams based on optocoupler and ad210an)

Pictures to be delivered

C语言:函数栈帧

Spark 任务Task调度异常分析

MySQL interview 40 consecutive questions, interviewer, if you continue to ask, I will turn my face

Spark 3.0 Adaptive Execution 代码实现及数据倾斜优化

JUC(JMM、Volatile)

3.3-5v conversion

【剑指offer】面试题41:数据流中的中位数——大、小堆实现

flutter —— 布局原理与约束
随机推荐
js运用扩展操作符(…)简化代码,简化数组合并
C:什么是函数中的返回值(转)
使用双星号代替Math.pow()
How to take satisfactory photos / videos from hololens
Spark动态资源分配的资源释放过程及BlockManager清理过程
后台返回来的是这种数据,是什么格式啊
修改 Spark 支持远程访问OSS文件
Dan bin Investment Summit: on the importance of asset management!
After configuring corswebfilter in grain mall, an error is reported: resource sharing error:multiplealloworiginvalues
3.3-5v conversion
multimap案例
Network equipment hard core technology insider router 20 dpdk (V)
Network equipment hard core technology insider router Chapter 4 Jia Baoyu sleepwalking in Taixu Fantasy (Part 2)
复杂度分析
With just two modifications, apple gave styleganv2 3D generation capabilities
《吐血整理》C#一些常用的帮助类
Cap theory and base theory
js使用一元运算符简化字符串转数字
Singles cup, web:web check in
Leetcode 190. reverse binary bit operation /easy