当前位置：网站首页>Mapwithstate of spark streaming state flow

Mapwithstate of spark streaming state flow

2022-07-26 16:06:00 【InfoQ】

background

Rigid contact spark-streaming, Then I wrote a WordCount Program , For the data that keeps flowing in , You need to add up the number of words , At this time, we need to persist the results of the previous period , Instead of discarding the data after calculation , Search the Internet for spark-streaming Can pass

updateStateByKey

and

mapWithState

To achieve this stateful flow management , Although the latter is spark1.6.x It is also an experimental implementation , However, its implementation idea and performance are good , So simply read the source code to understand the principle ( In fact, it may be understandable if it is too deep ~~). There are also some articles on the Internet to compare and analyze the two implementation methods , But I always feel like writing and painting by myself , According to your own ideas, it may be more conducive to deepen understanding .

Stateful flow , stay WordCount In this case , state (State) It means a word (Word) The number of occurrences in these flowing data (Count), Like stateless http The same goes through Cookie perhaps Session To maintain the interactive state , And flow is the same , adopt

mapWithState

Realize stateful flow .

Workflow

Personal understanding ,

mapWithState

There are two main ways to realize stateful management ：a) The historical state needs to be maintained in memory , It's necessary here ,

updateStateBykey

Is the same .b) Custom update status mappingFunction, These are the specific business function implementation logic ( When to update the status )

Simply draw

mapWithState

Working process of , First of all, the data flows in like water from the arrow on the left , hold

mapWithState

As a converter ,mappingFunc Is the rule of transformation , Incoming new data (key-value) Combined with historical status ( adopt key Historical status obtained from memory ) Perform some operations such as updating user-defined logic , Finally, it flows out of the red arrow .

Specific classes and relationships

Not many classes are mainly involved , But for scala Unfamiliar or spent a lot of time to see , Some specific computational logic is in ：

MapWithStateStreamImpl,InternalMapWithStateStream,MapWithStateRDD,MapWithStateRDDRecord

. The whole picture can be seen from top to bottom , From call

mapWithState

Underlying storage , Or from DStream->MapWithStateRDD(RDD[MapWithStateRDDRecord])->MapWithStateRDDRecord->Map This hierarchical structure , Combined with the following corresponding source code fragments, it will be easier to understand .

Source code analysis

With WordCount An example is the entry point , This example is from kafka Receiving data , In order to facilitate the test, please go directly to sock Just get the data ：

val sparkConf = new SparkConf().setAppName(&quot;WordCount&quot;).setMaster(&quot;local&quot;)
 val ssc = new StreamingContext(sparkConf, Seconds(10))
 ssc.checkpoint(&quot;d:\\tmp&quot;)
 val params = Map(&quot;bootstrap.servers&quot; -> &quot;master:9092&quot;, &quot;group.id&quot; -> &quot;scala-stream-group&quot;)
 val topic = Set(&quot;test&quot;)
 val initialRDD = ssc.sparkContext.parallelize(List[(String, Int)]())
 val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, params, topic)
 val word = messages.flatMap(_._2.split(&quot; &quot;)).map { x => (x, 1) }
 // Customize mappingFunction, Add up the number of times the word appears and update the status 
 val mappingFunc = (word: String, count: Option[Int], state: State[Int]) => {
 val sum = count.getOrElse(0) + state.getOption.getOrElse(0)
 val output = (word, sum)
 state.update(sum)
 output
 }
 // call mapWithState Manage the state of the stream data 
 val stateDstream = word.mapWithState(StateSpec.function(mappingFunc).initialState(initialRDD)).print()
 ssc.start()
 ssc.awaitTermination()

Get into

mapWithState

, Find out this is

PairDStreamFunctions

Methods , Because in DStream An implicit conversion function is defined in the companion object

toPairDStreamFunctions

, To cause to DStream The type is implicitly converted to PairDStreamFunctions type , So that you can use

mapWithState

Method ：

 implicit def toPairDStreamFunctions[K, V](stream: DStream[(K, V)])
 (implicit kt: ClassTag[K], vt: ClassTag[V], ord: Ordering[K] = null):
 PairDStreamFunctions[K, V] = {
 new PairDStreamFunctions[K, V](stream)
 }

to glance at

PairDStreamFunctions.mapWithState()

Method ：

@Experimental
 def mapWithState[StateType: ClassTag, MappedType: ClassTag](
 spec: StateSpec[K, V, StateType, MappedType]
 ): MapWithStateDStream[K, V, StateType, MappedType] = {
 new MapWithStateDStreamImpl[K, V, StateType, MappedType](
 self, # This is this batch DStream object 
 spec.asInstanceOf[StateSpecImpl[K, V, StateType, MappedType]]
 )
 }

mapWithState

stay spark1.6.x It's still in the experimental stage (@Experimental), adopt

StateSpec

Packaged mappingFunction function ( You can also set the timeout of the status 、 Initialized status data, etc ), And then

StateSpec

As a parameter to

mapWithState

.mappingFunction The function will be applied to all records of this batch of stream data (key-value), And corresponding to key( only ) The historical state of ( adopt key Acquired State object ) Perform the corresponding logical operation ( Namely mappingFunction Function body ).

stay

mapWithState

In this batch DStream Objects and StateSpecImpl( It is StateSpec Subclasses of ) The instance object of... Is constructed as a parameter MapWithStateDStreamImpl(MapWithStateDStream Subclasses of ). From the name, we can see that it represents a stateful flow , Take a look at one of its properties and one of its methods ：

private val internalStream =
 new InternalMapWithStateDStream[KeyType, ValueType, StateType, MappedType](dataStream, spec)
......

#  This method just converts the data that needs to be returned , The main logic is internalStream Of computer Inside 
override def compute(validTime: Time): Option[RDD[MappedType]] = {
 internalStream.getOrCompute(validTime).map { _.flatMap[MappedType] { _.mappedData } }
 }

MapWithStateDStreamImpl

There is no logical processing in , stay computer Called in

internalStream.getOrCompute

, This is actually a parent class DStream Methods , And then in

DStream.getOrCompute

Callback subclasses again

InternalMapWithStateDStream.compute

, Then take a look at the content of this method ：

InternalMapWithStateDStream: line 132

/** Method that generates a RDD for the given time */
 override def compute(validTime: Time): Option[RDD[MapWithStateRDDRecord[K, S, E]]] = {
 //  Calculate the last state RDD
 val prevStateRDD = getOrCompute(validTime - slideDuration) match {
 case Some(rdd) => 
 //  This rdd yes RDD[MapWithStateRDDRecord[K, S, E]] type , It's really just a MapWithStateRDD
 if (rdd.partitioner != Some(partitioner)) {
 // If the RDD is not partitioned the right way, let us repartition it using the
 // partition index as the key. This is to ensure that state RDD is always partitioned
 // before creating another state RDD using it
 // _.stateMap.getAll() Is to get all the status data ( Look down at the meeting and you will understand ), On this basis, create MapWithStateRDD
 MapWithStateRDD.createFromRDD[K, V, S, E](
 rdd.flatMap { _.stateMap.getAll() }, partitioner, validTime)
 } else {
 rdd
 }
 case None =>
 //  The first batch of data will enter this branch 
 MapWithStateRDD.createFromPairRDD[K, V, S, E](
 spec.getInitialStateRDD().getOrElse(new EmptyRDD[(K, S)](ssc.sparkContext)),
 partitioner,
 validTime
 )
 }
//  This batch of stream data consists of RDD:dataRDD
 val dataRDD = parent.getOrCompute(validTime).getOrElse {
 context.sparkContext.emptyRDD[(K, V)]
 }
 val partitionedDataRDD = dataRDD.partitionBy(partitioner) //  Repartition 
 val timeoutThresholdTime = spec.getTimeoutInterval().map { interval =>
 (validTime - interval).milliseconds
 }
 //  structure MapWithStateRDD
 Some(new MapWithStateRDD(
 prevStateRDD, partitionedDataRDD, mappingFunction, validTime, timeoutThresholdTime))
 }
}

compute The method is to generate MapWithStateRDD Of , First get

The previous state

Of prevStateRDD(validTime - once batch Time ):

If you can get results MapWithStateRDD

Judge that RDD Whether it has been partitioned correctly , The later calculation is to calculate the old state and the new incoming data corresponding to each partition , If the partitions of the two are inconsistent , Will cause calculation errors ( It will be updated to the wrong state ). If the partition is correct , Then go straight back RDD, Otherwise, by
MapWithStateRDD.createFromRDD
Appoint partitioner Repartition of data

If the calculation cannot get the result MapWithStateRDD

adopt
MapWithStateRDD.createFromPairRDD
Create a , If in WordCount There are pairs in the code StateSpec Set it up initialStateRDD Initial value , Then it will be in initialStateRDD On this basis MapWithStateRDD, Otherwise, create an empty MapWithStateRDD, Be careful , The same one is used here partitioner To ensure the same partition

And then from DStream Get this batch Data dataRDD(validTime), without , Then rebuild an empty RDD, Then the RDD Repartition , Make it with the above MapWithStateRDD Have the same partition .

And finally, construction MapWithStateRDD object ：

new MapWithStateRDD(
 prevStateRDD, partitionedDataRDD, mappingFunction, validTime, timeoutThresholdTime)

above WordCount In the example mappingFunc The function is here mappingFunction, And the parameters word And count come from partitionedDataRDD Data in , Parameters state come from prevStateRDD Data in , It basically conforms to the first figure above .

MapWithStateRDD Contains the previous status data (preStateRDD) And the stream data that needs to be updated this time (dataRDD), Look at it again compute The process ：

override def compute(
 partition: Partition, context: TaskContext): Iterator[MapWithStateRDDRecord[K, S, E]] = {

 val stateRDDPartition = partition.asInstanceOf[MapWithStateRDDPartition]
 val prevStateRDDIterator = prevStateRDD.iterator(
 stateRDDPartition.previousSessionRDDPartition, context)
 val dataIterator = partitionedDataRDD.iterator(
 stateRDDPartition.partitionedDataRDDPartition, context)
# prevRecord representative prevStateRDD Data of a partition 
 val prevRecord = if (prevStateRDDIterator.hasNext) Some(prevStateRDDIterator.next()) else None
 val newRecord = MapWithStateRDDRecord.updateRecordWithData(
 prevRecord,
 dataIterator,
 mappingFunction,
 batchTime,
 timeoutThresholdTime,
 removeTimedoutData = doFullScan // remove timedout data only when full scan is enabled
 )
 Iterator(newRecord)
 }

This method puts the previous state RDD And Ben batch Of RDD hand

MapWithStateRDDRecord.updateRecordWithData

Calculate , Look at the following code snippet , There is a place I don't understand ： Why only get prevStateRDD The first partition data of (preRecord), Someone clearly asks for an answer .

 def updateRecordWithData[K: ClassTag, V: ClassTag, S: ClassTag, E: ClassTag](
 prevRecord: Option[MapWithStateRDDRecord[K, S, E]],
 dataIterator: Iterator[(K, V)],
 mappingFunction: (Time, K, Option[V], State[S]) => Option[E],
 batchTime: Time,
 timeoutThresholdTime: Option[Long],
 removeTimedoutData: Boolean
 ): MapWithStateRDDRecord[K, S, E] = {
 //  from prevRecord Copy historical status data in （ Save in MapWithStateRDDRecord Of stateMap Properties of the ）
 val newStateMap = prevRecord.map { _.stateMap.copy() }. getOrElse { new EmptyStateMap[K, S]() }

 //  Save the data to be returned 
 val mappedData = new ArrayBuffer[E]
 val wrappedState = new StateImpl[S]() // State Subclass , Represents a state 

 //  Circular book batch All records (key-value)
 dataIterator.foreach { case (key, value) =>
 //  according to key Get this from the historical status data key Corresponding historical status 
 wrappedState.wrap(newStateMap.get(key)) 
 //  take mappingFunction Apply to this batch Every record of  
 val returned = mappingFunction(batchTime, key, Some(value), wrappedState)
 if (wrappedState.isRemoved) {
 newStateMap.remove(key)
 } else if (wrappedState.isUpdated
 || (wrappedState.exists && timeoutThresholdTime.isDefined)) {
 newStateMap.put(key, wrappedState.get(), batchTime.milliseconds)
 }
 mappedData ++= returned
 } //  Determine whether to delete 、 Update, etc., and then maintain newStateMap State in , And record the returned data mappedData

 //  If there is a configuration timeout , Some update operations will also be carried out 
 if (removeTimedoutData && timeoutThresholdTime.isDefined) {
 newStateMap.getByTime(timeoutThresholdTime.get).foreach { case (key, state, _) =>
 wrappedState.wrapTimingOutState(state)
 val returned = mappingFunction(batchTime, key, None, wrappedState)
 mappedData ++= returned
 newStateMap.remove(key)
 }
 }
 //  Still return one MapWithStateRDDRecord, Just the data inside has changed 
 MapWithStateRDDRecord(newStateMap, mappedData)
 }

In this , Circular book batch The data of , adopt key Get the corresponding status and apply mappingFunction Function to each record , stay WordCount In the example , Suppose a record key="hello", Get its status first ( If its last state is 3, There has been 3 Time ):

# word=&quot;hello&quot;,value=1,state=(&quot;hello&quot;,3)
val mappingFunc = (word: String, value: Option[Int], state: State[Int]) => {
 val sum = value.getOrElse(0) + state.getOption.getOrElse(0) #sum=4
 val output = (word, sum) #(&quot;hello&quot;,4)
 state.update(sum) # to update key=&quot;hello&quot; The status of is 4
 output #  return (&quot;hello&quot;,4)
 }

MapWithStateRDD It represents that the data status can be updated RDD, but RDD It is immutable in itself ,MapWithStateRDD The element is MapWithStateRDDRecord, and MapWithStateRDDRecord Mainly by stateMap:StateMap( Preserve the state of history ) as well as mappedData:Seq( Save back , That is, the outflow data ) These two attributes constitute , When doing mapWithState On conversion , According to the source code analysis above , In fact, it is to operate these two data structures , The final change is the data of these two attributes , and MapWithStateRDD The element of is still MapWithStateRDDRecord.

StateMap There is a OpenHashMapBasedStateMap, When calling create Method will create an object ：

/** Companion object for [[StateMap]], with utility methods */
private[streaming] object StateMap {
 def empty[K, S]: StateMap[K, S] = new EmptyStateMap[K, S]

 def create[K: ClassTag, S: ClassTag](conf: SparkConf): StateMap[K, S] = {
 val deltaChainThreshold = conf.getInt(&quot;spark.streaming.sessionByKey.deltaChainThreshold&quot;,
 DELTA_CHAIN_LENGTH_THRESHOLD)
 new OpenHashMapBasedStateMap[K, S](deltaChainThreshold)
 }
}

and OpenHashMapBasedStateMap It's based on spark A data structure

org.apache.spark.util.collection.OpenHashMap

, As for the implementation of the bottom layer, I'm interested in looking again , No further analysis . This article mainly focuses on the main line , There are many details of branches that need to be understood , This will deepen our understanding .