当前位置:网站首页>Spark Tuning (III): persistence reduces secondary queries

Spark Tuning (III): persistence reduces secondary queries

2022-07-07 16:28:00 InfoQ

Hello everyone , I'm Huai Jinyu , A big data sprouts new , There are two gold eaters at home , Jia and Jia , Up can code Lower energy teach My all-around Daddy
If you like my article , Sure [ Focus on ]+[ give the thumbs-up ]+[ Comment on ], Your third company is my driving force , Looking forward to growing with you ~


1.  cause

When we receive the data , It's usually necessary to etl With the , But the original data is best stored in the warehouse , Such a piece of data , We use 2 Time .

Spark For one RDD The default principle for performing multiple operators is : Every time you talk to one RDD When performing an operator operation , It will be recalculated from the source , Work out the RDD Come on , And then to this RDD Perform your operator operations . The performance of this method is very poor .

So if you save the original data first , When screening again , You will find that the data will be reloaded , It's a waste of time .

2.  Optimization starts

For multiple use RDD persist . here Spark According to your persistence strategy , take RDD The data in is saved to memory or disk . Every time after that RDD When doing operator operations , Will extract persistence directly from memory or disk RDD data , And then the operator , Instead of recalculating this from the source RDD, Then perform the operator operation .

If there is to be a RDD persist , Just for this RDD call cache() and persist()

cache() Method representation : Use non serialization to RDD All the data in are trying to persist into memory .

persist() Method representation : Select the persistence level manually , And persist in the specified way .

df = sc.sql(sql)
df1 = df.persist()
df1.createOrReplaceTempView(temp_table_name)
subdf = sc.sql(select * from temp_table_name)

In this case, it will not be reloaded RDD.

about persist() In terms of method , We can choose different persistence levels according to different business scenarios .
null
How to choose the most appropriate persistence strategy

  • By default , Of course, the highest performance is MEMORY_ONLY, But only if you have enough memory , More than enough to store the whole RDD All data for . Because there is no serialization or deserialization , This part of the performance overhead is avoided ; For this RDD The subsequent operator operations of , All operations are based on data in pure memory , There is no need to read data from the disk file , High performance ; And there's no need to make a copy of the data , And remote transmission to other nodes . But what we have to pay attention to here is , In the actual production environment , I'm afraid there are limited scenarios where this strategy can be used directly , If RDD When there are more data in ( For example, billions ), Use this persistence level directly , It can lead to JVM Of OOM Memory overflow exception .
  • If you use MEMORY_ONLY Memory overflow at level , So it is recommended to try to use MEMORY_ONLY_SER Level . This level will RDD Data is serialized and stored in memory , At this point, each of them partition It's just an array of bytes , It greatly reduces the number of objects , And reduce the memory consumption . This is a level ratio MEMORY_ONLY Extra performance overhead , The main thing is the cost of serialization and deserialization . But subsequent operators can operate based on pure memory , So the overall performance is relatively high . Besides , The possible problems are the same as above , If RDD If there is too much data in , Or it may lead to OOM Memory overflow exception .
  • If the level of pure memory is not available , Then it is recommended to use MEMORY_AND_DISK_SER Strategy , instead of MEMORY_AND_DISK Strategy . Because now that it's this step , Just explain RDD A lot of data , Memory can't be completely down . The serialized data is less , Can save memory and disk space overhead . At the same time, this strategy will try to cache data in memory as much as possible , Write to disk if memory cache is not available .
  • It is generally not recommended to use DISK_ONLY And suffixes are _2 The level of : Because the data is read and write based on the disk file , Can cause a dramatic performance degradation , Sometimes it's better to recalculate all RDD. The suffix is _2 The level of , All data must be copied in one copy , And send it to other nodes , Data replication and network transmission will lead to large performance overhead , Unless high availability of the job is required , Otherwise, it is not recommended to use .


Conclusion

If you like my article , Sure [ Focus on ]+[ give the thumbs-up ]+[ Comment on ], Your third company is my driving force , Looking forward to growing with you ~
We can pay attention to the official account 【 Huai Jin holds Yu's Jia and Jia 】, Get the resource download method
原网站

版权声明
本文为[InfoQ]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/188/202207071413162754.html