当前位置:网站首页>Spark Tuning (III): persistence reduces secondary queries
Spark Tuning (III): persistence reduces secondary queries
2022-07-07 16:28:00 【InfoQ】
1. cause
2. Optimization starts
df = sc.sql(sql)
df1 = df.persist()
df1.createOrReplaceTempView(temp_table_name)
subdf = sc.sql(select * from temp_table_name)

- By default , Of course, the highest performance is MEMORY_ONLY, But only if you have enough memory , More than enough to store the whole RDD All data for . Because there is no serialization or deserialization , This part of the performance overhead is avoided ; For this RDD The subsequent operator operations of , All operations are based on data in pure memory , There is no need to read data from the disk file , High performance ; And there's no need to make a copy of the data , And remote transmission to other nodes . But what we have to pay attention to here is , In the actual production environment , I'm afraid there are limited scenarios where this strategy can be used directly , If RDD When there are more data in ( For example, billions ), Use this persistence level directly , It can lead to JVM Of OOM Memory overflow exception .
- If you use MEMORY_ONLY Memory overflow at level , So it is recommended to try to use MEMORY_ONLY_SER Level . This level will RDD Data is serialized and stored in memory , At this point, each of them partition It's just an array of bytes , It greatly reduces the number of objects , And reduce the memory consumption . This is a level ratio MEMORY_ONLY Extra performance overhead , The main thing is the cost of serialization and deserialization . But subsequent operators can operate based on pure memory , So the overall performance is relatively high . Besides , The possible problems are the same as above , If RDD If there is too much data in , Or it may lead to OOM Memory overflow exception .
- If the level of pure memory is not available , Then it is recommended to use MEMORY_AND_DISK_SER Strategy , instead of MEMORY_AND_DISK Strategy . Because now that it's this step , Just explain RDD A lot of data , Memory can't be completely down . The serialized data is less , Can save memory and disk space overhead . At the same time, this strategy will try to cache data in memory as much as possible , Write to disk if memory cache is not available .
- It is generally not recommended to use DISK_ONLY And suffixes are _2 The level of : Because the data is read and write based on the disk file , Can cause a dramatic performance degradation , Sometimes it's better to recalculate all RDD. The suffix is _2 The level of , All data must be copied in one copy , And send it to other nodes , Data replication and network transmission will lead to large performance overhead , Unless high availability of the job is required , Otherwise, it is not recommended to use .
Conclusion
边栏推荐
- Regular expression string
- Markdown formula editing tutorial
- leetcode 241. Different ways to add parentheses design priority for operational expressions (medium)
- How does geojson data merge the boundaries of regions?
- Unity drawing plug-in = = [support the update of the original atlas]
- 记一次项目的迁移过程
- js中复选框checkbox如何判定为被选中
- SysOM 案例解析:消失的内存都去哪了 !| 龙蜥技术
- 通知Notification使用全解析
- The team of East China Normal University proposed the systematic molecular implementation of convolutional neural network with DNA regulation circuit
猜你喜欢
Unity3D_ Class fishing project, control the distance between collision walls to adapt to different models
MySQL数据库基本操作-DQL-基本查询
统计学习方法——感知机
谈谈 SAP iRPA Studio 创建的本地项目的云端部署问题
Enterprise log analysis system elk
【Android -- 数据存储】使用 SQLite 存储数据
You Yuxi, coming!
AE learning 01: AE complete project summary
1亿单身男女“在线相亲”,撑起130亿IPO
Xcode Revoke certificate
随机推荐
目标跟踪常见训练数据集格式
Lecturer solicitation order | Apache seatunnel (cultivating) meetup sharing guests are in hot Recruitment!
Use moviepy Editor clips videos and intercepts video clips in batches
JS中null NaN undefined这三个值有什么区别
【知识小结】PHP使用svn笔记总结
PHP has its own filtering and escape functions
Introduction to ThinkPHP URL routing
95. (cesium chapter) cesium dynamic monomer-3d building (building)
Set the route and optimize the URL in thinkphp3.2.3
Regular expression string
torch. Numel action
Laravel service provider instance tutorial - create a service provider test instance
Step by step monitoring platform ZABBIX
Shader basic UV operations, translation, rotation, scaling
leetcode 241. Different Ways to Add Parentheses 为运算表达式设计优先级(中等)
Asyncio concept and usage
Talk about the cloud deployment of local projects created by SAP IRPA studio
There are many ways to realize the pause function in JS
logback.xml配置不同级别日志,设置彩色输出
Unity3D_ Class fishing project, bullet rebound effect is achieved