当前位置:网站首页>Spark Tuning (III): persistence reduces secondary queries
Spark Tuning (III): persistence reduces secondary queries
2022-07-07 16:28:00 【InfoQ】
1. cause
2. Optimization starts
df = sc.sql(sql)
df1 = df.persist()
df1.createOrReplaceTempView(temp_table_name)
subdf = sc.sql(select * from temp_table_name)

- By default , Of course, the highest performance is MEMORY_ONLY, But only if you have enough memory , More than enough to store the whole RDD All data for . Because there is no serialization or deserialization , This part of the performance overhead is avoided ; For this RDD The subsequent operator operations of , All operations are based on data in pure memory , There is no need to read data from the disk file , High performance ; And there's no need to make a copy of the data , And remote transmission to other nodes . But what we have to pay attention to here is , In the actual production environment , I'm afraid there are limited scenarios where this strategy can be used directly , If RDD When there are more data in ( For example, billions ), Use this persistence level directly , It can lead to JVM Of OOM Memory overflow exception .
- If you use MEMORY_ONLY Memory overflow at level , So it is recommended to try to use MEMORY_ONLY_SER Level . This level will RDD Data is serialized and stored in memory , At this point, each of them partition It's just an array of bytes , It greatly reduces the number of objects , And reduce the memory consumption . This is a level ratio MEMORY_ONLY Extra performance overhead , The main thing is the cost of serialization and deserialization . But subsequent operators can operate based on pure memory , So the overall performance is relatively high . Besides , The possible problems are the same as above , If RDD If there is too much data in , Or it may lead to OOM Memory overflow exception .
- If the level of pure memory is not available , Then it is recommended to use MEMORY_AND_DISK_SER Strategy , instead of MEMORY_AND_DISK Strategy . Because now that it's this step , Just explain RDD A lot of data , Memory can't be completely down . The serialized data is less , Can save memory and disk space overhead . At the same time, this strategy will try to cache data in memory as much as possible , Write to disk if memory cache is not available .
- It is generally not recommended to use DISK_ONLY And suffixes are _2 The level of : Because the data is read and write based on the disk file , Can cause a dramatic performance degradation , Sometimes it's better to recalculate all RDD. The suffix is _2 The level of , All data must be copied in one copy , And send it to other nodes , Data replication and network transmission will lead to large performance overhead , Unless high availability of the job is required , Otherwise, it is not recommended to use .
Conclusion
边栏推荐
- Laravel 服务提供者实例教程 —— 创建 Service Provider 测试实例
- pycharm 终端部启用虚拟环境
- Laravel post shows an exception when submitting data
- How does geojson data merge the boundaries of regions?
- 安科瑞电网智能化发展的必然趋势电力系统采用微机保护装置是
- Bidding announcement: 2022 Yunnan Unicom gbase database maintenance public comparison and selection project (second) comparison and selection announcement
- Excessive dependence on subsidies, difficult collection of key customers, and how strong is the potential to reach the dream of "the first share of domestic databases"?
- Leetcode-136- number that appears only once (solve with XOR)
- What about the pointer in neural network C language
- IP地址和物理地址有什么区别
猜你喜欢

SPI master RX time out interrupt

Odoo integrated plausible embedded code monitoring platform

95. (cesium chapter) cesium dynamic monomer-3d building (building)

分步式监控平台zabbix

"The" "PIP" "entry cannot be recognized as the name of a cmdlet, function, script file, or runnable program."

华东师大团队提出,具有DNA调控电路的卷积神经网络的系统分子实现

AE learning 01: AE complete project summary

分步式監控平臺zabbix

How does geojson data merge the boundaries of regions?

Apache Doris just "graduated": why should we pay attention to this kind of SQL data warehouse?
随机推荐
47_Opencv中的轮廓查找 cv::findContours()
leetcode 241. Different ways to add parentheses design priority for operational expressions (medium)
应用程序和matlab的通信方式
Laravel 服务提供者实例教程 —— 创建 Service Provider 测试实例
How to query the data of a certain day, a certain month, and a certain year in MySQL
预测——灰色预测
leetcode 241. Different Ways to Add Parentheses 为运算表达式设计优先级(中等)
【知识小结】PHP使用svn笔记总结
[vulnhub range] thales:1
php 自带过滤和转义函数
Good news! Kelan sundb database and Hongshu technology privacy data protection management software complete compatibility adaptation
What is the difference between IP address and physical address
logback.xml配置不同级别日志,设置彩色输出
Prometheus API deletes all data of a specified job
What are compiled languages and interpreted languages?
删除 console 语句引发的惨案
Unity的三种单例模式(饿汉,懒汉,MonoBehaviour)
95.(cesium篇)cesium动态单体化-3D建筑物(楼栋)
修改配置文件后tidb无法启动
How does geojson data merge the boundaries of regions?