当前位置:网站首页>Spark Tuning (III): persistence reduces secondary queries
Spark Tuning (III): persistence reduces secondary queries
2022-07-07 16:28:00 【InfoQ】
1. cause
2. Optimization starts
df = sc.sql(sql)
df1 = df.persist()
df1.createOrReplaceTempView(temp_table_name)
subdf = sc.sql(select * from temp_table_name)
- By default , Of course, the highest performance is MEMORY_ONLY, But only if you have enough memory , More than enough to store the whole RDD All data for . Because there is no serialization or deserialization , This part of the performance overhead is avoided ; For this RDD The subsequent operator operations of , All operations are based on data in pure memory , There is no need to read data from the disk file , High performance ; And there's no need to make a copy of the data , And remote transmission to other nodes . But what we have to pay attention to here is , In the actual production environment , I'm afraid there are limited scenarios where this strategy can be used directly , If RDD When there are more data in ( For example, billions ), Use this persistence level directly , It can lead to JVM Of OOM Memory overflow exception .
- If you use MEMORY_ONLY Memory overflow at level , So it is recommended to try to use MEMORY_ONLY_SER Level . This level will RDD Data is serialized and stored in memory , At this point, each of them partition It's just an array of bytes , It greatly reduces the number of objects , And reduce the memory consumption . This is a level ratio MEMORY_ONLY Extra performance overhead , The main thing is the cost of serialization and deserialization . But subsequent operators can operate based on pure memory , So the overall performance is relatively high . Besides , The possible problems are the same as above , If RDD If there is too much data in , Or it may lead to OOM Memory overflow exception .
- If the level of pure memory is not available , Then it is recommended to use MEMORY_AND_DISK_SER Strategy , instead of MEMORY_AND_DISK Strategy . Because now that it's this step , Just explain RDD A lot of data , Memory can't be completely down . The serialized data is less , Can save memory and disk space overhead . At the same time, this strategy will try to cache data in memory as much as possible , Write to disk if memory cache is not available .
- It is generally not recommended to use DISK_ONLY And suffixes are _2 The level of : Because the data is read and write based on the disk file , Can cause a dramatic performance degradation , Sometimes it's better to recalculate all RDD. The suffix is _2 The level of , All data must be copied in one copy , And send it to other nodes , Data replication and network transmission will lead to large performance overhead , Unless high availability of the job is required , Otherwise, it is not recommended to use .
Conclusion
边栏推荐
- logback.xml配置不同级别日志,设置彩色输出
- laravel 是怎么做到运行 composer dump-autoload 不清空 classmap 映射关系的呢?
- 通知Notification使用全解析
- laravel构造函数和中间件执行顺序问题
- Use moviepy Editor clips videos and intercepts video clips in batches
- 应用程序和matlab的通信方式
- Laravel changed the session from file saving to database saving
- Usage of config in laravel
- hellogolang
- Iptables only allows the specified IP address to access the specified port
猜你喜欢
HAVE FUN | “飞船计划”活动最新进展
What about the pointer in neural network C language
谈谈 SAP iRPA Studio 创建的本地项目的云端部署问题
平衡二叉树(AVL)
MySQL数据库基本操作-DQL-基本查询
The team of East China Normal University proposed the systematic molecular implementation of convolutional neural network with DNA regulation circuit
Apache Doris just "graduated": why should we pay attention to this kind of SQL data warehouse?
torch. Numel action
Good news! Kelan sundb database and Hongshu technology privacy data protection management software complete compatibility adaptation
企业级日志分析系统ELK
随机推荐
prometheus api删除某个指定job的所有数据
thinkphp3.2.3中设置路由,优化url
IP地址和物理地址有什么区别
Use moviepy Editor clips videos and intercepts video clips in batches
URL和URI的关系
MySQL数据库基本操作-DQL-基本查询
Xcode Revoke certificate
Odoo integrated plausible embedded code monitoring platform
JS中null NaN undefined这三个值有什么区别
Continuous creation depends on it!
How to implement backspace in shell
谈谈 SAP iRPA Studio 创建的本地项目的云端部署问题
目标跟踪常见训练数据集格式
深度之眼(七)——矩阵的初等变换(附:数模一些模型的解释)
markdown公式编辑教程
Step by step monitoring platform ZABBIX
Unity drawing plug-in = = [support the update of the original atlas]
Laravel5.1 Routing - routing packets
hellogolang
PHP中exit,exit(0),exit(1),exit(‘0’),exit(‘1’),die,return的区别