当前位置:网站首页>Spark 3.0 测试与使用
Spark 3.0 测试与使用
2022-07-27 14:23:00 【wankunde】
Compatibility with hadoop and hive
Spark 3.0 官方默认支持的Hadoop最低版本为2.7, Hive最低版本为 1.2。我们平台使用的CDH 5.13,对应的版本分别为hadoop-2.6.0, hive-1.1.0。所以尝试自己去编译Spark 3.0 来使用。
编译环境: Maven 3.6.3, Java 8, Scala 2.12
Hive版本预先编译
因为Hive 1.1.0 实在是太久远了,很多依赖包和Spark3.0中不兼容,需要需要重新编译。
commons-lang3 包版本太旧了,缺少JAVA_9以上的支持
hive exec 模块编译: mvn clean package install -DskipTests -pl ql -am -Phadoop-2
代码改动:
diff --git a/pom.xml b/pom.xml
index 5d14dc4..889b960 100644
--- a/pom.xml
+++ b/pom.xml
@@ -248,6 +248,14 @@
<enabled>false</enabled>
</snapshots>
</repository>
+ <repository>
+ <id>spring</id>
+ <name>Spring repo</name>
+ <url>https://repo.spring.io/plugins-release/</url>
+ <releases>
+ <enabled>true</enabled>
+ </releases>
+ </repository>
</repositories>
<!-- Hadoop dependency management is done at the bottom under profiles -->
@@ -982,6 +990,9 @@
<profiles>
<profile>
<id>thriftif</id>
+ <properties>
+ <thrift.home>/usr/local/opt/[email protected]</thrift.home>
+ </properties>
<build>
<plugins>
<plugin>
diff --git a/ql/pom.xml b/ql/pom.xml
index 0c5e91f..101ef11 100644
--- a/ql/pom.xml
+++ b/ql/pom.xml
@@ -736,10 +736,7 @@
<include>org.apache.hive:hive-exec</include>
<include>org.apache.hive:hive-serde</include>
<include>com.esotericsoftware.kryo:kryo</include>
- <include>com.twitter:parquet-hadoop-bundle</include>
- <include>org.apache.thrift:libthrift</include>
<include>commons-lang:commons-lang</include>
- <include>org.apache.commons:commons-lang3</include>
<include>org.jodd:jodd-core</include>
<include>org.json:json</include>
<include>org.apache.avro:avro</include>
Spark 编译
# Apache
git clone [email protected]:apache/spark.git
git checkout v3.0.0
# Leyan Version 主要设计spark hive的兼容性改造
git clone [email protected]:HDP/spark.git
git checkout -b v3.0.0_cloudera origin/v3.0.0_cloudera
./dev/make-distribution.sh --name cloudera --tgz -DskipTests -Phive -Phive-thriftserver -Pyarn -Pcdhhive
--本地仓库更新
mvn clean install -DskipTests=true -Phive -Phive-thriftserver -Pyarn -Pcdhhive
# deploy
rm -rf /opt/spark-3.0.0-bin-cloudera
tar -zxvf spark-3.0.0-bin-cloudera.tgz
rm -rf /opt/spark-3.0.0-bin-cloudera/conf
ln -s /etc/spark3/conf /opt/spark-3.0.0-bin-cloudera/conf
cd /opt/spark/jars
zip spark-3.0.0-jars.zip ./*
HADOOP_USER_NAME=hdfs hdfs dfs -put -f spark-3.0.0-jars.zip hdfs:///deploy/config/spark-3.0.0-jars.zip
rm spark-3.0.0-jars.zip
# add config : spark.yarn.archive=hdfs:///deploy/config/spark-3.0.0-jars.zip
Configuration
# 开启AE模式
spark.sql.adaptive.enabled=true
# spark生成parquet文件使用legacy模式,否则生成的文件无法被Hive或其他组件读取
spark.sql.parquet.writeLegacyFormat=true
# 兼容使用Spark2的 external shuffle service
spark.shuffle.useOldFetchProtocol=true
spark.sql.storeAssignmentPolicy=LEGACY
# 现在默认datasource v2不支持数据源表和目标表是同一个表,通过下面参数跳过校验
#spark.sql.hive.convertInsertingPartitionedTable=false
spark.sql.sources.partitionOverwriteVerifyPath=false
Tips
- 在使用Maven编译的时候,以前版本支持多CPU并发编译,现在不可以了,否则编译的时候会导致死锁
- 在使用maven命令进行编译的使用不能同时指定
package和install,否则编译时会有冲突 - 模版编译命令
mvn clean install -DskipTests -Phive -Phive-thriftserver -Pyarn -DskipTests -Pcdhhive,可以自定义编译模块和编译target - 想要使用Spark3.0,还是需要进行魔改的。
yarn模块要稍微改动。mvn clean install -DskipTests=true -pl resource-managers/yarn -am -Phive -Phive-thriftserver -Pyarn -Pcdhhive - 所有Spark3.0 包在本地全部安装完毕后,可以继续编译above-board项目
- 删除Spark3.0 中对高版本hive的支持
- 当切换到CDH的hive版本时发现,该hive版本shade的commons jar太旧了,进行重新打包
TroubleShooting
需要更新本地Hive-exec包的依赖,打包的时候减少shade thrift包的代码
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 5, prd-zboffline-044.prd.leyantech.com, executor 1): java.lang.NoSuchMethodError: shaded.parquet.org.apache.thrift.EncodingUtils.setBit(BIZ)B
at org.apache.parquet.format.FileMetaData.setVersionIsSet(FileMetaData.java:349)
at org.apache.parquet.format.FileMetaData.setVersion(FileMetaData.java:335)
at org.apache.parquet.format.Util$DefaultFileMetaDataConsumer.setVersion(Util.java:122)
at org.apache.parquet.format.Util$5.consume(Util.java:161)
at org.apache.parquet.format.event.TypedConsumer$I32Consumer.read(TypedConsumer.java:78)
边栏推荐
- Network device hard core technology insider router Chapter 15 from deer by device to router (Part 2)
- 魔塔项目中的问题解决
- 《剑指Offer》 合并两个排序的链表
- Unity性能优化------渲染优化(GPU)之LOD(Level of detail)
- Network equipment hard core technology insider router Chapter 5 tompkinson roaming the network world (Part 1)
- CAN总线的EMC设计方案
- USB interface electromagnetic compatibility (EMC) solution
- 《剑指Offer》 链表反转
- STM32F10x_硬件I2C读写EEPROM(标准外设库版本)
- After configuring corswebfilter in grain mall, an error is reported: resource sharing error:multiplealloworiginvalues
猜你喜欢

What is the breakthrough point of digital transformation in the electronic manufacturing industry? Lean manufacturing is the key

西瓜书《机器学习》阅读笔记之第一章绪论

Leetcode 781. rabbit hash table in forest / mathematical problem medium

Code coverage statistical artifact -jacobo tool practice

USB2.0接口的EMC设计方案

TL431-2.5v基准电压芯片几种基本用法

CAN总线的EMC设计方案

Watermelon book machine learning reading notes Chapter 1 Introduction

Several basic uses of tl431-2.5v voltage reference chip

Google team launches new transformer to optimize panoramic segmentation scheme CVPR 2022
随机推荐
RS485接口的EMC设计方案
reflex
Two stage submission and three stage submission
Huayun data creates a perfect information technology and innovation talent training system to help the high-quality development of information technology and innovation industry
Unity最简洁的对象池实现
Several basic uses of tl431-2.5v voltage reference chip
How to take satisfactory photos / videos from hololens
Watermelon book machine learning reading notes Chapter 1 Introduction
Is it safe to open an account on a mobile phone?
Method of removing top navigation bar in Huawei Hongmeng simulator
USB2.0接口的EMC设计方案
初探JuiceFS
Zhou Hongyi: if the digital security ability is backward, it will also be beaten
js寻找数组中的最大和最小值(Math.max()方法)
Adaptation verification new occupation is coming! Huayun data participated in the preparation of the national vocational skill standard for information system adaptation verifiers
DevEco Studio2.1运行项目报错
Spark TroubleShooting整理
With just two modifications, apple gave styleganv2 3D generation capabilities
修改 Spark 支持远程访问OSS文件
4种单片机驱动继电器方案