当前位置:网站首页>Spark Tuning (III): persistence reduces secondary queries
Spark Tuning (III): persistence reduces secondary queries
2022-07-07 16:28:00 【InfoQ】
1. cause
2. Optimization starts
df = sc.sql(sql)
df1 = df.persist()
df1.createOrReplaceTempView(temp_table_name)
subdf = sc.sql(select * from temp_table_name)

- By default , Of course, the highest performance is MEMORY_ONLY, But only if you have enough memory , More than enough to store the whole RDD All data for . Because there is no serialization or deserialization , This part of the performance overhead is avoided ; For this RDD The subsequent operator operations of , All operations are based on data in pure memory , There is no need to read data from the disk file , High performance ; And there's no need to make a copy of the data , And remote transmission to other nodes . But what we have to pay attention to here is , In the actual production environment , I'm afraid there are limited scenarios where this strategy can be used directly , If RDD When there are more data in ( For example, billions ), Use this persistence level directly , It can lead to JVM Of OOM Memory overflow exception .
- If you use MEMORY_ONLY Memory overflow at level , So it is recommended to try to use MEMORY_ONLY_SER Level . This level will RDD Data is serialized and stored in memory , At this point, each of them partition It's just an array of bytes , It greatly reduces the number of objects , And reduce the memory consumption . This is a level ratio MEMORY_ONLY Extra performance overhead , The main thing is the cost of serialization and deserialization . But subsequent operators can operate based on pure memory , So the overall performance is relatively high . Besides , The possible problems are the same as above , If RDD If there is too much data in , Or it may lead to OOM Memory overflow exception .
- If the level of pure memory is not available , Then it is recommended to use MEMORY_AND_DISK_SER Strategy , instead of MEMORY_AND_DISK Strategy . Because now that it's this step , Just explain RDD A lot of data , Memory can't be completely down . The serialized data is less , Can save memory and disk space overhead . At the same time, this strategy will try to cache data in memory as much as possible , Write to disk if memory cache is not available .
- It is generally not recommended to use DISK_ONLY And suffixes are _2 The level of : Because the data is read and write based on the disk file , Can cause a dramatic performance degradation , Sometimes it's better to recalculate all RDD. The suffix is _2 The level of , All data must be copied in one copy , And send it to other nodes , Data replication and network transmission will lead to large performance overhead , Unless high availability of the job is required , Otherwise, it is not recommended to use .
Conclusion
边栏推荐
- Regular expression string
- 平衡二叉树(AVL)
- Three singleton modes of unity (hungry man, lazy man, monobehavior)
- [Android -- data storage] use SQLite to store data
- 目标跟踪常见训练数据集格式
- Aerospace Hongtu information won the bid for the database system research and development project of a unit in Urumqi
- Logback logging framework third-party jar package is available for free
- Laravel5.1 Routing - routing packets
- 分类模型评价标准(performance measure)
- Bidding announcement: 2022 Yunnan Unicom gbase database maintenance public comparison and selection project (second) comparison and selection announcement
猜你喜欢
随机推荐
How to implement backspace in shell
Power of leetcode-231-2
"The" "PIP" "entry cannot be recognized as the name of a cmdlet, function, script file, or runnable program."
Unity的三种单例模式(饿汉,懒汉,MonoBehaviour)
logback. XML configure logs of different levels and set color output
Markdown formula editing tutorial
Statistical learning method -- perceptron
Unity3D_ Class fishing project, bullet rebound effect is achieved
Apache Doris just "graduated": why should we pay attention to this kind of SQL data warehouse?
How to query the data of a certain day, a certain month, and a certain year in MySQL
The inevitable trend of the intelligent development of ankerui power grid is that microcomputer protection devices are used in power systems
Laravel service provider instance tutorial - create a service provider test instance
01tire+ chain forward star +dfs+ greedy exercise one
SysOM 案例解析:消失的内存都去哪了 !| 龙蜥技术
Introduction to pyGame games
Leetcode-136- number that appears only once (solve with XOR)
There are many ways to realize the pause function in JS
PHP has its own filtering and escape functions
JS 模块化
Enterprise log analysis system elk