当前位置：网站首页>Spark stage and shuffle for daily data processing

Spark stage and shuffle for daily data processing

2022-06-24 07:27:00 【Something new】

Spark Stage, DAG(Directed Acyclic Graph)

Spark Divide Stage Is based on the Job Generated DAG, In discrete mathematics we have learned a Directed acyclic graph (Directed Acyclic Graph) The concept of , In a reproduction environment , The task I wrote was just Directed trees (Directed tree) Level , Directed acyclic graphs have not been encountered yet . But you can imagine , If used in the code RDD Of join Operators are possible Directed acyclic graph Of DAG. For the log data processing used by our group , The main focus is Directed tree complexity Logical topology .

PS: A directed tree must be Directed acyclic graph , Directed acyclic graphs are not always directed trees . You can mend your brain by yourself Abstracting a process into a topology can better add various optimization measures to it , Not like it Hadoop MapReduce Generally, the results of each step are written back to , Cause a lot of waste .

This is the case in our business scenario , The original collected logs , Cut out small fields , And in order , This operation I call normalization . And perform a series of operations on normalized data .
- real_data.map(deal_data_func).reduceByKey(merge_data_func).foreachRDD(store_data_func)
stay store_data_func in Use foreachPartition Connect with the storage medium . stay Spark in , This method is called action

RDD Methods

RDD There are two types of methods transformation and action, If and only if action When called ,Spark Will actually submit the task to DAG Scheduler , And then assigned to Task Scheduler

If you are writing Spark Project time , Just did it transformation But did not submit action, Now Spark Would do nothing！ real_data.map(deal_data_func).reduceByKey(merge_data_func) This kind of writing is unusual Spark There is nothing unexpected in the project , It can even be considered complete . This is the MapReduce One of the biggest differences , because MapReduce It doesn't matter Stage Divide , Many people read the old code on the Internet , Start new Spark I fell into this misunderstanding .

The reason Spark Need to submit action Then the calculation is actually performed , To make the most of it DAG Divide Stage Advantages , Including but not limited to Reduce computation ,I/O load etc.
In many transformation In operation , Mentioned in the previous article , They are divided into two categories ： Wide dependence (reduceByKey, ...), Narrow dependence (map,flatMap, ...)
- The latter is much simpler than the former , Just for everyone Partition Do a mapping for each data in the ,Partition The number does not change
- The former is a little more complicated , Because in this type of operation , Our goal is to obtain an extraction of global data （ For the same key Of value Add up ）, But when the amount of data is too large to be fully accommodated on one machine , We just need Spark To schedule and segment data and reallocate Partition And the data .
- Wide dependence Generated new RDD Of Partition Number is the biggest puzzle and black box for beginners （ Including me ）, One day I finally couldn't help , Check the source code , With reduceByKey As an example ： # reduceByKey There are three types of function signatures , Be clear at a glance 1.def reduceByKey( partitioner: Partitioner, func: JFunction2[V, V, V] ) 2.def reduceByKey( func: JFunction2[V, V, V], numPartitions: Int ) 3.def reduceByKey( func: JFunction2[V, V, V] )
- What we use most is The shortest , Overloaded mode with only one parameter 3
- 2 a 3 One more parameter numPartitions, This parameter means that we can specify Partition number , So many people say on the Internet Partition by Spark Self generated with Certain misleading , But this function only works if you know Spark Only when using the dispatching principle .
- 1 This is the focus of this time , The first parameter is Partitioner Variable of type , We can guess , If we use 3 when , Neither specify numPartitions Nor does it specify one Partitioner There must be one default Things that are , Used to determine the reduceByKey After Partition Number
- Keep going through the source code , stay 3 We see in the function implementation of defaultPartitioner Instantiation , And called 1： fromRDD(reduceByKey(defaultPartitioner(rdd), func)) # Signature , It can be seen that , The instance must pass in at least one rdd parametric def defaultPartitioner( rdd: RDD[_], others: RDD[_]* )
defaultPartitioner yes Spark A built-in implementation , It implements a section of settings new RDDPartition Logic
1. If there is more than one used RDD Pass in , And it's not set spark.default.parallelism Parameters be new RDD Of Partition The number of used RDD The largest of
2. If you set spark.default.parallelism Parameters , be new RDD Of Partition The number is the value of this parameter . val defaultNumPartitions = if (rdd.context.conf.contains("spark.default.parallelism")) { rdd.context.defaultParallelism } else { rdds.map(_.partitions.length).max }

spark.default.parallelism Parameter in Spark Set when the project is initialized , Save in SparkContext in , Used to Spark People are not strangers , In general, this value is set to Excutor * Excutor-core * 2 Now I understand Partition The calculation of quantity comes from , See the more detailed source code operation , You can read Spark Core Medium Partitioner.scala file , Very concise .

RDD Medium Partition Number is important , The reason is that it determines to a large extent Spark Of Concurrent effect , The last article mentioned , RDD Of Partition· It's different from where you are Stage Medium task One-to-one correspondence , This is the same thing spark.default.parallelism The origin of the parameter name .

stay Spark Of Patch For in the Partition The choice of number has always been a hot topic , If you are interested, you can take a look at this Patch(https://issues.apache.org/jira/plugins/servlet/mobile#issue/SPARK-6377) , But so far Spark-2.3.2, It is still my above conclusion But actually Spark SQL There has been a dynamic adjustment Partition Number of function codes , 1spark.sql.adaptive.enable=true1

Stage Division

actually Stage The division of should be best understood , Or you don't need to delve into source level understanding , In practice , What we need to pay attention to most , Is when it will happen Shuffle, and Stage The division of is to find out Shuffle Where it should happen
Shuffle The occurrence of means , Data may migrate between different nodes , Write from memory to file , A series of lossy operations in which memory reads the contents of a file ,90% The above scenario requires less Shuffle The better .
1. Through difference transformation To achieve this goal , The most classic use reduceByKey Replace groupByKey I won't go into that , The principle is The former will aggregate the local data first , And then transfer to other nodes , Reduce Shuffle
2. Stage stay Spark In essence, it is a series of things that can parallel Executive task Set , Divide Stage The standard is Wide dependence Appearance . Take the example at the beginning of the article as the prototype

As you can see from the diagram , When executed reduceByKey when ,Shuffle And it started , If your Spark It's a set of functions many Cluster of nodes
1. First, it will be done locally reduceByKey, Get a local only <key,value>
2. Then Shuffle-Write( Such as Write Disk), from DAGScheduler Select which data to allocate to which node (defaultpartitioner decision )
3. Then at the destination node Shuffle-Read( Such as Read Network) Actively pull data
4. And then finally merge , At this point, for any... On any node key It's the only one in the world
As can be seen from the above , Want to reduce Shuffle Consumption of , Except for the reduction Shuffle The number of times . And try to reduce every time Shuffle The size of the amount of data .

stay Shuffle later , Our project scenarios generally need to store calculation results , The storage of calculation results determines whether this batch of tasks can be truly completed to a certain extent , It can be roughly divided into In place storage and Centralized storage , It will be detailed in the next chapter .

[Extra]Shuffle Read&Write Small touch

This part of the details , In fact, it is not very helpful for the application in the actual project , Just to get to know Spark Within , All you need to know is , Shuffle Bring all kinds of IO Unavoidable ,Spark Various optimization algorithms are being added , To reduce the cost of this part . You need to set the scene here , We use the default storage medium ,Shuffle Write Is to write data to the local disk .

When a piece of data goes through various normalization , Last call Narrow dependence transformation when , Still take the above example as the background .
The first thing that happens is Shuffle Write,Spark We will confirm this time first Comparator (Partitioner), As can be seen from the above , There are two functions of a partition ：
1. Identify new RDD The number of partitions
2. Decide which data is placed in which partitions
When Spark Determined the number of partitions
1. First, it will use the internal Algorithm Do the local data first reduceByKey
2. Then create a new temporary file locally , This will be based on various situations （ for example Partition Number , Serialization, etc ） Choose different Shuffle Write Algorithm , Write out intermediate results to disk .
3. according to Partitioner Decide what key Which partition does the data belong to , And sorted by partition number in memory , When out of memory , Write to disk , And bring the index file , To identify different partition data （ This file is arranged in order ）.
4. Finally, when the other end is ready to pull data , Then merge the data of the same partition distributed in different files , To the other end .

In the figure ,1 Situated Task And used RDD Of Partition One-to-one correspondence , stay 3 Stage do a merge .4 Stage Task Represents the remote end Shuffle Read Of Task, The quantity is the same as new RDD Of Partition Same and one-to-one correspondence .

There are too many details here , because Shuffle Write There are many algorithms , Spark Choose which algorithm to use to output files to reduce performance loss according to the situation . The situation mentioned above is also one of them SortShuffle nothing more

原网站

版权声明
本文为[Something new]所创，转载请带上原文链接，感谢
https://yzsam.com/2021/07/20210701201142995k.html