当前位置:网站首页>Spark Tuning (III): persistence reduces secondary queries
Spark Tuning (III): persistence reduces secondary queries
2022-07-07 16:28:00 【InfoQ】
1. cause
2. Optimization starts
df = sc.sql(sql)
df1 = df.persist()
df1.createOrReplaceTempView(temp_table_name)
subdf = sc.sql(select * from temp_table_name)

- By default , Of course, the highest performance is MEMORY_ONLY, But only if you have enough memory , More than enough to store the whole RDD All data for . Because there is no serialization or deserialization , This part of the performance overhead is avoided ; For this RDD The subsequent operator operations of , All operations are based on data in pure memory , There is no need to read data from the disk file , High performance ; And there's no need to make a copy of the data , And remote transmission to other nodes . But what we have to pay attention to here is , In the actual production environment , I'm afraid there are limited scenarios where this strategy can be used directly , If RDD When there are more data in ( For example, billions ), Use this persistence level directly , It can lead to JVM Of OOM Memory overflow exception .
- If you use MEMORY_ONLY Memory overflow at level , So it is recommended to try to use MEMORY_ONLY_SER Level . This level will RDD Data is serialized and stored in memory , At this point, each of them partition It's just an array of bytes , It greatly reduces the number of objects , And reduce the memory consumption . This is a level ratio MEMORY_ONLY Extra performance overhead , The main thing is the cost of serialization and deserialization . But subsequent operators can operate based on pure memory , So the overall performance is relatively high . Besides , The possible problems are the same as above , If RDD If there is too much data in , Or it may lead to OOM Memory overflow exception .
- If the level of pure memory is not available , Then it is recommended to use MEMORY_AND_DISK_SER Strategy , instead of MEMORY_AND_DISK Strategy . Because now that it's this step , Just explain RDD A lot of data , Memory can't be completely down . The serialized data is less , Can save memory and disk space overhead . At the same time, this strategy will try to cache data in memory as much as possible , Write to disk if memory cache is not available .
- It is generally not recommended to use DISK_ONLY And suffixes are _2 The level of : Because the data is read and write based on the disk file , Can cause a dramatic performance degradation , Sometimes it's better to recalculate all RDD. The suffix is _2 The level of , All data must be copied in one copy , And send it to other nodes , Data replication and network transmission will lead to large performance overhead , Unless high availability of the job is required , Otherwise, it is not recommended to use .
Conclusion
边栏推荐
- 安科瑞电网智能化发展的必然趋势电力系统采用微机保护装置是
- php 自带过滤和转义函数
- Plate - forme de surveillance par étapes zabbix
- Tragedy caused by deleting the console statement
- Eye of depth (VI) -- inverse of matrix (attachment: some ideas of logistic model)
- spark调优(三):持久化减少二次查询
- 模仿企业微信会议室选择
- Particle effect for ugui
- U3D_ Infinite Bessel curve
- 深度之眼(六)——矩阵的逆(附:logistic模型一些想法)
猜你喜欢

1亿单身男女“在线相亲”,撑起130亿IPO

Strengthen real-time data management, and the British software helps the security construction of the medical insurance platform

Logback日志框架第三方jar包 免费获取

Power of leetcode-231-2

融云斩获 2022 中国信创数字化办公门户卓越产品奖!

torch.numel作用

pycharm 终端部启用虚拟环境

删除 console 语句引发的惨案

MySQL数据库基本操作-DQL-基本查询

spark调优(三):持久化减少二次查询
随机推荐
iptables只允许指定ip地址访问指定端口
面试题 01.02. 判定是否互为字符重排-辅助数组算法
JS中null NaN undefined这三个值有什么区别
torch.numel作用
Performance measure of classification model
Bidding announcement: 2022 Yunnan Unicom gbase database maintenance public comparison and selection project (second) comparison and selection announcement
[Android -- data storage] use SQLite to store data
IP地址和物理地址有什么区别
【Android -- 数据存储】使用 SQLite 存储数据
You Yuxi, coming!
Plate - forme de surveillance par étapes zabbix
Migration and reprint
Xcode Revoke certificate
企业级日志分析系统ELK
JS modularization
23. 合并K个升序链表-c语言
删除 console 语句引发的惨案
Introduction to ThinkPHP URL routing
logback. XML configure logs of different levels and set color output
hellogolang