当前位置:网站首页>Spark source code learning - Data Serialization
Spark source code learning - Data Serialization
2022-07-27 00:58:00 【A photographer who can't play is not a good programmer】
The content of the source code
Data Serialization
Serialization plays an important role in the performance of any distributed application. Formats that are slow to serialize objects into, or consume a large number of bytes, will greatly slow down the computation. Often, this will be the first thing you should tune to optimize a Spark application. Spark aims to strike a balance between convenience (allowing you to work with any Java type in your operations) and performance. It provides two serialization libraries:
translate : Serialization plays an important role in the performance of any distributed application . Serializing objects very slowly or consuming a large number of bytes will greatly reduce the computational speed . Usually , This is optimization Spark The first thing to do when applying .Spark The goal is convenience ( Allows you to use any Java type ) And performance . Two serialization libraries are provided :
1.Java serialization:
By default, Spark serializes objects using Java’s ObjectOutputStream framework, and can work with any class you create that implements java.io.Serializable. You can also control the performance of your serialization more closely by extending java.io.Externalizable. Java serialization is flexible but often quite slow, and leads to large serialized formats for many classes.
translate :Java serialization Is the default serialization method ,Spark Serialization use of Java Of ObjectOutputStream frame , And it can work with any class you create ( Realized java.io.Serializable) Working together . You can also expand java.io.Externalizable To more closely control the performance of serialization .Java Serialization is flexible , But usually quite slowly , And leads to large serialization formats for many classes .
( summary :Java serialization The serialization method of is spark The default serialization method , Very flexible , But it's slow )
2.Kryo serialization:
Spark can also use the Kryo library (version 4) to serialize objects more quickly. Kryo is significantly faster and more compact than Java serialization (often as much as 10x), but does not support all Serializable types and requires you to register the classes you’ll use in the program in advance for best performance.
translate :Spark You can also use Kryo library , Serialize objects faster .Kryo Than Java Serialization is much faster , Much more compact ( Usually as high as 10 times ), But it does not support all Serializable type , And you are required to register the classes that will be used in the program in advance to get the best performance .
( summary :kryo serialization Serialization is faster , But it's not flexible enough )
You can switch to using Kryo by initializing your job with a SparkConf and calling conf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”). This setting configures the serializer used for not only shuffling data between worker nodes but also when serializing RDDs to disk. The only reason Kryo is not the default is because of the custom registration requirement, but we recommend trying it in any network-intensive application. Since Spark 2.0.0, we internally use Kryo serializer when shuffling RDDs with simple types, arrays of simple types, or string type.
translate : You can switch between Kryo By using SparkConf And call conf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”). To initialize your homework .
. The serializer configured by this setting is not only used to transfer data between work nodes , It is also used to put rdd Serialize to disk .Kryo The only reason why it is not the default setting is to customize the registration requirements , But we suggest trying it . from Spark 2.0.0 Start , When using simple types 、 Simple type array or string type transformation rdd when , We use... Internally Kryo Serializer .
To register your own custom classes with Kryo, use the registerKryoClasses method.
val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
val sc = new SparkContext(conf)
summary :
1.Spark Provides two ways to serialize :Java serializa and kryo serialization
2.Java serialization yes spark The default serialization method , Very flexible , But it's slow
3.kryo serialization fast , But it's not flexible enough , You need to register in advance when using
边栏推荐
猜你喜欢

DOM day_03(7.11) 事件冒泡机制、事件委托、待办事项、阻止默认事件、鼠标坐标、页面滚动事件、创建DOM元素、DOM封装操作

flink需求之—ProcessFunction(需求:如果30秒内温度连续上升就报警)
![[CISCN2019 总决赛 Day2 Web1]Easyweb](/img/36/1ca4b6cae4e0dda0916b511d4bcd9f.png)
[CISCN2019 总决赛 Day2 Web1]Easyweb
![[CTF攻防世界] WEB区 关于备份的题目](/img/af/b78eb3522160896d77d9e82f7e7810.png)
[CTF攻防世界] WEB区 关于备份的题目

MYSQL 使用及实现排名函数RANK、DENSE_RANK和ROW_NUMBER

DOM day_02(7.8)网页制作流程、图片src属性、轮播图、自定义属性、标签栏、输入框事件、勾选操作、访问器语法

flinksql 窗口提前触发

基于Flink实时项目:用户行为分析(二:实时流量统计)

BUUCTF-随便注、Exec、EasySQL、Secret File

Essay - I say you are so cute
随机推荐
[CISCN2019 华北赛区 Day1 Web5]CyberPunk
数据仓库知识点
Canal 介绍
Channel shutdown: channel error; protocol method: #method<channel.close>(reply-code=406, reply-text=
DOM day_ 01 (7.7) introduction and core operation of DOM
啊啊啊啊啊啊啊a
JSCORE day_02(7.1)
Use csrftester to automatically detect CSRF vulnerabilities
10 Web APIs
flinksql 窗口提前触发
redis——缓存雪崩、缓存穿透、缓存击穿
Yolo of Darknet_ Forward of layer_ yolo_ Layer comments
flink需求之—SideOutPut(侧输出流的应用:将温度大于30℃的输出到主流,低于30℃的输出到侧流)
Elaborate on the differences and usage of call, apply and bind 20211031
Detailed explanation of this point in JS
DOM day_01(7.7) dom的介绍和核心操作
select查询题目练习
基于Flink实时项目:用户行为分析(三:网站总浏览量统计(PV))
2022.7.13
2022.7.18DAY608