当前位置:网站首页>Spark 3.0 testing and use
Spark 3.0 testing and use
2022-07-27 15:37:00 【wankunde】
Compatibility with hadoop and hive
Spark 3.0 Officially supported by default Hadoop The minimum version is 2.7, Hive The minimum version is 1.2. Our platform uses CDH 5.13, The corresponding versions are hadoop-2.6.0, hive-1.1.0. So try to compile it yourself Spark 3.0 To use .
Compile environment : Maven 3.6.3, Java 8, Scala 2.12
Hive Version precompiled
because Hive 1.1.0 It's too long , Many rely on packages and Spark3.0 In the incompatible , Need to recompile .
commons-lang3 The package version is too old , The lack of JAVA_9 The above support
hive exec Module compilation : mvn clean package install -DskipTests -pl ql -am -Phadoop-2
Code changes :
diff --git a/pom.xml b/pom.xml
index 5d14dc4..889b960 100644
--- a/pom.xml
+++ b/pom.xml
@@ -248,6 +248,14 @@
<enabled>false</enabled>
</snapshots>
</repository>
+ <repository>
+ <id>spring</id>
+ <name>Spring repo</name>
+ <url>https://repo.spring.io/plugins-release/</url>
+ <releases>
+ <enabled>true</enabled>
+ </releases>
+ </repository>
</repositories>
<!-- Hadoop dependency management is done at the bottom under profiles -->
@@ -982,6 +990,9 @@
<profiles>
<profile>
<id>thriftif</id>
+ <properties>
+ <thrift.home>/usr/local/opt/[email protected]</thrift.home>
+ </properties>
<build>
<plugins>
<plugin>
diff --git a/ql/pom.xml b/ql/pom.xml
index 0c5e91f..101ef11 100644
--- a/ql/pom.xml
+++ b/ql/pom.xml
@@ -736,10 +736,7 @@
<include>org.apache.hive:hive-exec</include>
<include>org.apache.hive:hive-serde</include>
<include>com.esotericsoftware.kryo:kryo</include>
- <include>com.twitter:parquet-hadoop-bundle</include>
- <include>org.apache.thrift:libthrift</include>
<include>commons-lang:commons-lang</include>
- <include>org.apache.commons:commons-lang3</include>
<include>org.jodd:jodd-core</include>
<include>org.json:json</include>
<include>org.apache.avro:avro</include>
Spark compile
# Apache
git clone [email protected]:apache/spark.git
git checkout v3.0.0
# Leyan Version Main design spark hive Compatibility transformation
git clone [email protected]:HDP/spark.git
git checkout -b v3.0.0_cloudera origin/v3.0.0_cloudera
./dev/make-distribution.sh --name cloudera --tgz -DskipTests -Phive -Phive-thriftserver -Pyarn -Pcdhhive
-- Local warehouse update
mvn clean install -DskipTests=true -Phive -Phive-thriftserver -Pyarn -Pcdhhive
# deploy
rm -rf /opt/spark-3.0.0-bin-cloudera
tar -zxvf spark-3.0.0-bin-cloudera.tgz
rm -rf /opt/spark-3.0.0-bin-cloudera/conf
ln -s /etc/spark3/conf /opt/spark-3.0.0-bin-cloudera/conf
cd /opt/spark/jars
zip spark-3.0.0-jars.zip ./*
HADOOP_USER_NAME=hdfs hdfs dfs -put -f spark-3.0.0-jars.zip hdfs:///deploy/config/spark-3.0.0-jars.zip
rm spark-3.0.0-jars.zip
# add config : spark.yarn.archive=hdfs:///deploy/config/spark-3.0.0-jars.zip
Configuration
# Turn on AE Pattern
spark.sql.adaptive.enabled=true
# spark Generate parquet Files use legacy Pattern , Otherwise, the generated file cannot be Hive Or other components read
spark.sql.parquet.writeLegacyFormat=true
# Compatible with the use Spark2 Of external shuffle service
spark.shuffle.useOldFetchProtocol=true
spark.sql.storeAssignmentPolicy=LEGACY
# Now default datasource v2 It is not supported that the data source table and the target table are the same table , Skip verification with the following parameters
#spark.sql.hive.convertInsertingPartitionedTable=false
spark.sql.sources.partitionOverwriteVerifyPath=false
Tips
- In the use of Maven When compiling , Previous versions supported many CPU Concurrent compilation , Not now , Otherwise, it will cause deadlock when compiling
- In the use of maven Command for compilation cannot be specified at the same time
packageandinstall, Otherwise, there will be conflicts during compilation - Template compilation commands
mvn clean install -DskipTests -Phive -Phive-thriftserver -Pyarn -DskipTests -Pcdhhive, You can customize the compilation module and compilation target - Want to use Spark3.0, It still needs magic modification .
yarnThe module needs to be changed slightly .mvn clean install -DskipTests=true -pl resource-managers/yarn -am -Phive -Phive-thriftserver -Pyarn -Pcdhhive - all Spark3.0 After all packages are installed locally , You can continue to compile above-board project
- Delete Spark3.0 Medium to high version hive Support for
- When you switch to CDH Of hive Version found , The hive edition shade Of commons jar It's too old , To repack
TroubleShooting
Need to update local Hive-exec Package dependency , Less time to pack shade thrift The code of the package
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 5, prd-zboffline-044.prd.leyantech.com, executor 1): java.lang.NoSuchMethodError: shaded.parquet.org.apache.thrift.EncodingUtils.setBit(BIZ)B
at org.apache.parquet.format.FileMetaData.setVersionIsSet(FileMetaData.java:349)
at org.apache.parquet.format.FileMetaData.setVersion(FileMetaData.java:335)
at org.apache.parquet.format.Util$DefaultFileMetaDataConsumer.setVersion(Util.java:122)
at org.apache.parquet.format.Util$5.consume(Util.java:161)
at org.apache.parquet.format.event.TypedConsumer$I32Consumer.read(TypedConsumer.java:78)
边栏推荐
- 聊聊面试必问的索引
- Hyperlink parsing in MD: parsing `this$ Set() `, ` $` should be preceded by a space or escape character`\`
- 【剑指offer】面试题49:丑数
- What format is this data returned from the background
- Spark 3.0 Adaptive Execution 代码实现及数据倾斜优化
- Implement custom spark optimization rules
- How to package AssetBundle
- Network equipment hard core technology insider router Chapter 16 dpdk and its prequel (I)
- 使用解构交换两个变量的值
- How to edit a framework resource file separately
猜你喜欢

【剑指offer】面试题42:连续子数组的最大和——附0x80000000与INT_MIN
![[TensorBoard] OSError: [Errno 22] Invalid argument处理](/img/bf/c995f487607e3b307a268779ec1e94.png)
[TensorBoard] OSError: [Errno 22] Invalid argument处理

C语言:函数栈帧

初识结构体

The design method of integral operation circuit is introduced in detail

【剑指offer】面试题50:第一个只出现一次的字符——哈希表查找

Tools - common methods of markdown editor

QT (five) meta object properties

npm install错误 unable to access

C语言:自定义类型
随机推荐
Spark lazy list files 的实现
STM32F10x_ Hardware I2C read / write EEPROM (standard peripheral library version)
Jump to the specified position when video continues playing
multimap案例
Network equipment hard core technology insider router Chapter 17 dpdk and its prequel (II)
Transactions_ Basic demonstrations and transactions_ Default auto submit & manual submit
C语言:数据的存储
Spark 3.0 adaptive execution code implementation and data skew optimization
Network equipment hard core technology insider router Chapter 7 tompkinson roaming the network world (Part 2)
QT (five) meta object properties
Leetcode 244 week competition - post competition supplementary question solution [broccoli players]
【剑指offer】面试题50:第一个只出现一次的字符——哈希表查找
Leetcode 1143. dynamic programming of the longest common subsequence /medium
js运用扩展操作符(…)简化代码,简化数组合并
Network equipment hard core technology insider router Chapter 16 dpdk and its prequel (I)
Spark RPC
Leetcode-1737- minimum number of characters to change if one of the three conditions is met
后台返回来的是这种数据,是什么格式啊
The design method of integral operation circuit is introduced in detail
Go language learning notes (1)