当前位置：网站首页>Spark source code learning - Data Serialization

Spark source code learning - Data Serialization

2022-07-27 00:58:00 【A photographer who can't play is not a good programmer】

The content of the source code

Data Serialization

Serialization plays an important role in the performance of any distributed application. Formats that are slow to serialize objects into, or consume a large number of bytes, will greatly slow down the computation. Often, this will be the first thing you should tune to optimize a Spark application. Spark aims to strike a balance between convenience (allowing you to work with any Java type in your operations) and performance. It provides two serialization libraries:
translate ： Serialization plays an important role in the performance of any distributed application . Serializing objects very slowly or consuming a large number of bytes will greatly reduce the computational speed . Usually , This is optimization Spark The first thing to do when applying .Spark The goal is convenience ( Allows you to use any Java type ) And performance . Two serialization libraries are provided :

1.Java serialization:

By default, Spark serializes objects using Java’s ObjectOutputStream framework, and can work with any class you create that implements java.io.Serializable. You can also control the performance of your serialization more closely by extending java.io.Externalizable. Java serialization is flexible but often quite slow, and leads to large serialized formats for many classes.

translate ：Java serialization Is the default serialization method ,Spark Serialization use of Java Of ObjectOutputStream frame , And it can work with any class you create （ Realized java.io.Serializable） Working together . You can also expand java.io.Externalizable To more closely control the performance of serialization .Java Serialization is flexible , But usually quite slowly , And leads to large serialization formats for many classes .
（ summary ：Java serialization The serialization method of is spark The default serialization method , Very flexible , But it's slow ）

2.Kryo serialization:

Spark can also use the Kryo library (version 4) to serialize objects more quickly. Kryo is significantly faster and more compact than Java serialization (often as much as 10x), but does not support all Serializable types and requires you to register the classes you’ll use in the program in advance for best performance.

translate ：Spark You can also use Kryo library , Serialize objects faster .Kryo Than Java Serialization is much faster , Much more compact ( Usually as high as 10 times ), But it does not support all Serializable type , And you are required to register the classes that will be used in the program in advance to get the best performance .
（ summary ：kryo serialization Serialization is faster , But it's not flexible enough ）

You can switch to using Kryo by initializing your job with a SparkConf and calling conf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”). This setting configures the serializer used for not only shuffling data between worker nodes but also when serializing RDDs to disk. The only reason Kryo is not the default is because of the custom registration requirement, but we recommend trying it in any network-intensive application. Since Spark 2.0.0, we internally use Kryo serializer when shuffling RDDs with simple types, arrays of simple types, or string type.

translate ： You can switch between Kryo By using SparkConf And call conf.set(“spark.serializer”, “org.apache.spark.serializer.KryoSerializer”). To initialize your homework .
. The serializer configured by this setting is not only used to transfer data between work nodes , It is also used to put rdd Serialize to disk .Kryo The only reason why it is not the default setting is to customize the registration requirements , But we suggest trying it . from Spark 2.0.0 Start , When using simple types 、 Simple type array or string type transformation rdd when , We use... Internally Kryo Serializer .

To register your own custom classes with Kryo, use the registerKryoClasses method.

val conf = new SparkConf().setMaster(...).setAppName(...)
conf.registerKryoClasses(Array(classOf[MyClass1], classOf[MyClass2]))
val sc = new SparkContext(conf)

summary ：

1.Spark Provides two ways to serialize ：Java serializa and kryo serialization
2.Java serialization yes spark The default serialization method , Very flexible , But it's slow
3.kryo serialization fast , But it's not flexible enough , You need to register in advance when using

原网站

版权声明
本文为[A photographer who can't play is not a good programmer]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/208/202207262238332815.html