当前位置:网站首页>Spark Tuning (III): persistence reduces secondary queries
Spark Tuning (III): persistence reduces secondary queries
2022-07-07 16:28:00 【InfoQ】
1. cause
2. Optimization starts
df = sc.sql(sql)
df1 = df.persist()
df1.createOrReplaceTempView(temp_table_name)
subdf = sc.sql(select * from temp_table_name)

- By default , Of course, the highest performance is MEMORY_ONLY, But only if you have enough memory , More than enough to store the whole RDD All data for . Because there is no serialization or deserialization , This part of the performance overhead is avoided ; For this RDD The subsequent operator operations of , All operations are based on data in pure memory , There is no need to read data from the disk file , High performance ; And there's no need to make a copy of the data , And remote transmission to other nodes . But what we have to pay attention to here is , In the actual production environment , I'm afraid there are limited scenarios where this strategy can be used directly , If RDD When there are more data in ( For example, billions ), Use this persistence level directly , It can lead to JVM Of OOM Memory overflow exception .
- If you use MEMORY_ONLY Memory overflow at level , So it is recommended to try to use MEMORY_ONLY_SER Level . This level will RDD Data is serialized and stored in memory , At this point, each of them partition It's just an array of bytes , It greatly reduces the number of objects , And reduce the memory consumption . This is a level ratio MEMORY_ONLY Extra performance overhead , The main thing is the cost of serialization and deserialization . But subsequent operators can operate based on pure memory , So the overall performance is relatively high . Besides , The possible problems are the same as above , If RDD If there is too much data in , Or it may lead to OOM Memory overflow exception .
- If the level of pure memory is not available , Then it is recommended to use MEMORY_AND_DISK_SER Strategy , instead of MEMORY_AND_DISK Strategy . Because now that it's this step , Just explain RDD A lot of data , Memory can't be completely down . The serialized data is less , Can save memory and disk space overhead . At the same time, this strategy will try to cache data in memory as much as possible , Write to disk if memory cache is not available .
- It is generally not recommended to use DISK_ONLY And suffixes are _2 The level of : Because the data is read and write based on the disk file , Can cause a dramatic performance degradation , Sometimes it's better to recalculate all RDD. The suffix is _2 The level of , All data must be copied in one copy , And send it to other nodes , Data replication and network transmission will lead to large performance overhead , Unless high availability of the job is required , Otherwise, it is not recommended to use .
Conclusion
边栏推荐
- PyTorch 中的乘法:mul()、multiply()、matmul()、mm()、mv()、dot()
- 【Vulnhub靶场】THALES:1
- 01tire+ chain forward star +dfs+ greedy exercise one
- Laravel changed the session from file saving to database saving
- laravel post提交数据时显示异常
- What about the pointer in neural network C language
- thinkphp3.2.3中设置路由,优化url
- 【HCSD大咖直播】亲授大厂面试秘诀-简要笔记
- Prometheus API deletes all data of a specified job
- 谈谈 SAP iRPA Studio 创建的本地项目的云端部署问题
猜你喜欢

Rongyun won the 2022 China Xinchuang digital office portal excellence product award!

TiDB For PostgreSQL和YugabyteDB在Sysbench上的性能对比
通知Notification使用全解析

Eye of depth (VI) -- inverse of matrix (attachment: some ideas of logistic model)

SysOM 案例解析:消失的内存都去哪了 !| 龙蜥技术

分步式监控平台zabbix

Odoo integrated plausible embedded code monitoring platform

1亿单身男女“在线相亲”,撑起130亿IPO

Leetcode-231-2的幂

2022 the 4th China (Jinan) International Smart elderly care industry exhibition, Shandong old age Expo
随机推荐
1亿单身男女“在线相亲”,撑起130亿IPO
记一次项目的迁移过程
Lecturer solicitation order | Apache seatunnel (cultivating) meetup sharing guests are in hot Recruitment!
JS 模块化
删除 console 语句引发的惨案
121. 买卖股票的最佳时机
PyTorch 中的乘法:mul()、multiply()、matmul()、mm()、mv()、dot()
How can laravel get the public path
Tragedy caused by deleting the console statement
企业级日志分析系统ELK
Strengthen real-time data management, and the British software helps the security construction of the medical insurance platform
修改配置文件后tidb无法启动
Shader_ Animation sequence frame
Leetcode-231-2的幂
ThinkPHP URL 路由简介
logback.xml配置不同级别日志,设置彩色输出
Unity的三种单例模式(饿汉,懒汉,MonoBehaviour)
Set the route and optimize the URL in thinkphp3.2.3
PHP实现微信小程序人脸识别刷脸登录功能
Particle effect for ugui