当前位置：网站首页>Spark Learning: how to choose different association forms and mechanisms?

Spark Learning: how to choose different association forms and mechanisms?

2022-07-24 09:35:00 【I love evening primrose a】

Spark Sql in join There are many kinds of , Here, it is divided into correlation form and time mechanism
One 、 Data preparation
eg:

import spark.implicits._
import org.apache.spark.sql.DataFrame
// Create an employee table 
val seq = Seq((1,"li",20,"Male"),(2,"shi",22,"Female"),(3,"ming",24,"Female"))
val employees:DataFrame = seq.toDF("id","name","age","gender")
// Create salary table 
val seq2 = Seq((1,20000),(2,26000),(3,24000),(4,30000))
val salaries:DataFrame = seq2.toDF("id","salary")

Two 、 Association form

Association form	keyword
Internal connection	inner
Left outer connection	left/leftouter/left_outer
Right external connection	right/rightouter/right_outer
All external connections	outer/full/fullouter/full_outer
Left half Association	leftsemi/left_semi
Left inverse association	leftanti/left_anti

Internal connection 、 Left outer connection 、 Right external connection 、 External connection and sql The same in grammar , I won't elaborate here
1、 Left half Association :
The result set of left semi association is actually a subset of inner Association , It will only keep those data records that meet the association conditions in the left table
eg:

val leftsemijoinDF:DataFrame = salaries.join(employees,salaries("id") === employees("id"),"leftsemi")
leftsemijoinDF.show

As shown in the figure ,salaries There are three pieces of data and employees It matches , The left outer Association will only retain the data matching the left table and the right table
2、 Left inverse association
After understanding the left half Correlation , Left inverse correlation is also easy to understand , Similarly, only the left data is retained , But only keep the unmatched data
eg:

val leftantijoinDF:DataFrame = salaries.join(employees,salaries("id") === employees("id"),"leftanti")
leftantijoinDF.show

3、 ... and 、 Implementation mechanism
1、NLJ(Nested Loop Join) Nested loop connection

For the two tables involved in association , Such as salaries and employees, In the order in which the codes appear , Generally put salaries Called drive table ,employees Called the base table , stay NLJ Under the implementation mechanism of , The algorithm can be used outside 、 Two nested for loop , To scan the data records of the drive table and the base table in turn , While scanning , It also determines whether the correlation condition is true , As the example above shows , On the outside floor for Cycle first scan the drive table , Scan to id by 1 One hour of , The nested inner loop is also scanning , When the internal circulation is also scanned 1 when , Will match the output , And so on , Judge the recorded in turn id Whether the field meets the conditions

Suppose the drive table has m Bar record , The base table has n Bar record , that NLJ The computational complexity of the algorithm is O(mn), Poor execution efficiency *
2、SMJ(Sort Merge Join) Sort merge connections

SMJ Will sort the two tables first , Then use a separate cursor , Merge and associate the two sorted tables The specific calculation process :
At first , The cursors of the drive table and the base table will be on their first record , Then, by comparing the record where the cursor is located id value , To determine the next step
a、 Satisfy the correlation conditions , On both sides id The values are equal , At this time, the data records on both sides are spliced and output , Then slide the cursor of the drive table to the next record
b、 The association condition is not satisfied , The driver table id The value is less than the base table id value , Now slide the cursor of the drive table to the next record
c、 The association condition is not satisfied , The driver table id The value is greater than the base table id value , Now slide the cursor of the base table to the next record

therefore ,SMJ The computational complexity of the algorithm is O(m+n), But the computational complexity is reduced , It depends on arranging the order first , This is a time-consuming operation , You can sort by indexing
3、HJ(Hash Join) Hash join ( Space for time )
HJ The calculation is divided into two stages , Namely Bulid Phase and Probe Stage

a、 stay Bulid Stage , Above the base table , The algorithm uses the established hash function to build a hash table , In the hash table key yes id Field applies the hash value of the hash function , And hash table value It also includes the original Join
Key and Payload, As shown in the figure below

Insert picture description here

b、 stay Probe Stage , The algorithm traverses each data record of the driving table in turn , First use the same Hash function , Calculate in a dynamic manner Join
Key Hash value of , then , The algorithm then uses the hash value to query just in Bulid Hash table created in phase , If the query fails , This indicates that there is no association between this record and the data in the base table ; contrary , If the query is successful , Then continue to compare the two sides Join
Key, If Join Key Agreement , Just splice the records on both sides for output , So as to carry out data association

Four 、NLJ/SMJ/HJ The advantages and disadvantages of
1、hash join The most efficient execution , however , stay Probe Stage before enjoying performance bonus ,Bulid Stage I have to build a hash table in memory , therefore hash join High memory requirements , It is suitable for computing scenarios where the memory can hold base table data
2、sort merge join There is no memory limitation , Disk can be used for sorting or merging , The applicable scenario is participation join The table of is an ordered table
3、nested loop join Poor efficiency , But it can deal with unequal Correlation , and hash join and sort merge join Can only handle equivalent connections
5、 ... and 、 Form of association in distributed environment ?？？
Form of association in distributed environment

原网站

版权声明
本文为[I love evening primrose a]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/204/202207221524152174.html

当前位置：网站首页>Spark Learning: how to choose different association forms and mechanisms?

Spark Learning: how to choose different association forms and mechanisms?

边栏推荐

猜你喜欢

随机推荐