当前位置:网站首页>Spark项目打包优化实践
Spark项目打包优化实践
2022-06-24 06:39:00 【Angryshark_128】
问题描述
在使用Scala/Java进行Spark项目开发过程中,常涉及项目构建和打包上传,因项目依赖Spark基础相关类包一般较大,打包后若涉及远程开发调试,每次打包都消耗多很多时间,因此需对此过程进行优化。
优化方案
方案1:一次全量上传jar包,后续增量更新class
POM文件配置(Maven)
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
........
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.4</version>
<scope>test</scope>
</dependency>
</dependencies>
<!-- 构建配置 -->
<build>
<resources>
<resource>
<directory>src/main/resources</directory>
</resource>
</resources>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<configuration>
<recompileMode>incremental</recompileMode>
</configuration>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.4.1</version>
<configuration>
<!-- get all project dependencies -->
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<!-- bind to the packaging phase -->
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
按照如上配置进行打包,会得到*-1.0-SNAPSHOT.jar和*-1.0-SNAPSHOT-jar-with-dependencies.jar两个jar包,后者是可单独执行的jar包,但因打包进去很多无用依赖,导致即便一个很简单的项目,也要一两百M。
原理须知:jar包其实只是一个普通的rar压缩包,解压后内部由相关jar包和编译后的class文件、静态资源文件等组成。也就是说,我们每次修改代码重新打包后,只是更新了其中个别class或静态资源文件,因而后续更新只需替换更新代码后的class文件即可。
例:
写一个简单的sparktest项目,打包后会出现sparktest-1.0-SNAPSHOT.jar和sparktest-1.0-SNAPSHOT-jar-with-dependencies.jar两个jar包。

其中sparktest-1.0-SNAPSHOT-jar-with-dependencies.jar为单独可执行jar包,上传至服务器即可执行,使用解压软件打开该jar可看到目录结构。

其中,App*.class文件即为主代码对应的编译文件,

修改App.scala代码后,执行重新compile一下,在target/classes目录下即可看到新的App*.class

将更新后的class文件,上传至服务器jar包同目录下,替换即可
jar uvf sparktest-1.0-SNAPSHOT-jar-with-dependencies.jar App*.class
注:若该class文件不在jar包根目录下,则创建相同目录,然后替换,如
jar uvf sparktest-1.0-SNAPSHOT-jar-with-dependencies.jar com/example/App*.class
方案2:依赖与项目分开上传,后续单独更新项目jar包
POM文件配置(Maven)
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
......
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.4</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<executions>
<execution>
<id>copy-dependencies</id>
<phase>package</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<outputDirectory>target/lib</outputDirectory>
<excludeTransitive>false</excludeTransitive>
<stripVersion>true</stripVersion>
</configuration>
</execution>
</executions>
</plugin>
<!--scala打包插件-->
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.3.1</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<!--java代码打包插件,不会将依赖也打包-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<archive>
<manifest>
<!-- <addClasspath>true</addClasspath> -->
<mainClass>com.oidd.App</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
</plugins>
打包完成后,会出现单独jar包和lib目录

上传jar包和lib文件夹至服务器,后续更新只需要替换jar包即可,执行spark-submit时,只需要添加–jars *\lib*.jar目录即可。
边栏推荐
- Use of SQL join
- Cloudcompare & PCL point cloud clipping (based on clipping box)
- leetcode:84. The largest rectangle in the histogram
- Spirit information development log (1)
- 潞晨科技获邀加入NVIDIA初创加速计划
- The cloud monitoring system hertzbeat V1.1.0 is released, and a command starts the monitoring journey!
- leetcode:1856. Maximum value of minimum product of subarray
- Online font converter what is the meaning of font conversion
- How to register the cloud service platform and what are the advantages of cloud server
- Leetcode: Sword finger offer 26: judge whether T1 contains all topologies of T2
猜你喜欢

leetcode:84. 柱状图中最大的矩形

缓存操作rockscache原理图

年中了,准备了少量的自动化面试题,欢迎来自测

leetcode:85. Max rectangle

35 year old crisis? It has become a synonym for programmers

原神方石机关解密

Typora收费?搭建VS Code MarkDown写作环境

35岁危机?内卷成程序员代名词了

Localized operation on cloud, the sea going experience of kilimall, the largest e-commerce platform in East Africa

Application configuration management, basic principle analysis
随机推荐
华为云低时延技术的九大绝招
How do I reinstall the system? How to install win10 system with USB flash disk?
Le système de surveillance du nuage hertzbeat v1.1.0 a été publié, une commande pour démarrer le voyage de surveillance!
leetcode:剑指 Offer 26:判断t1中是否含有t2的全部拓扑结构
What are the audio formats? Can the audio format be converted
Why does the remote end receive a check-out notice when the TRTC applet turns off audio and video locally
Use of SQL join
About Stacked Generalization
Jumping game ii[greedy practice]
oracle sql综合运用 习题
【愚公系列】2022年6月 ASP.NET Core下CellReport报表工具基本介绍和使用
数据库 存储过程 begin end
Record -- about the problem of garbled code when JSP foreground passes parameters to the background
Come on, it's not easy for big factories to do projects!
Source code analysis of current limiting component uber/ratelimit
On BOM and DOM (6): bit value calculation of DOM objects and event objects, such as offsetx/top and clearx
About Stacked Generalization
MAUI使用Masa blazor组件库
leetcode:1856. Maximum value of minimum product of subarray
记录--关于JSP前台传参数到后台出现乱码的问题