当前位置:网站首页>Spark Tuning (III): persistence reduces secondary queries
Spark Tuning (III): persistence reduces secondary queries
2022-07-07 16:28:00 【InfoQ】
1. cause
2. Optimization starts
df = sc.sql(sql)
df1 = df.persist()
df1.createOrReplaceTempView(temp_table_name)
subdf = sc.sql(select * from temp_table_name)
- By default , Of course, the highest performance is MEMORY_ONLY, But only if you have enough memory , More than enough to store the whole RDD All data for . Because there is no serialization or deserialization , This part of the performance overhead is avoided ; For this RDD The subsequent operator operations of , All operations are based on data in pure memory , There is no need to read data from the disk file , High performance ; And there's no need to make a copy of the data , And remote transmission to other nodes . But what we have to pay attention to here is , In the actual production environment , I'm afraid there are limited scenarios where this strategy can be used directly , If RDD When there are more data in ( For example, billions ), Use this persistence level directly , It can lead to JVM Of OOM Memory overflow exception .
- If you use MEMORY_ONLY Memory overflow at level , So it is recommended to try to use MEMORY_ONLY_SER Level . This level will RDD Data is serialized and stored in memory , At this point, each of them partition It's just an array of bytes , It greatly reduces the number of objects , And reduce the memory consumption . This is a level ratio MEMORY_ONLY Extra performance overhead , The main thing is the cost of serialization and deserialization . But subsequent operators can operate based on pure memory , So the overall performance is relatively high . Besides , The possible problems are the same as above , If RDD If there is too much data in , Or it may lead to OOM Memory overflow exception .
- If the level of pure memory is not available , Then it is recommended to use MEMORY_AND_DISK_SER Strategy , instead of MEMORY_AND_DISK Strategy . Because now that it's this step , Just explain RDD A lot of data , Memory can't be completely down . The serialized data is less , Can save memory and disk space overhead . At the same time, this strategy will try to cache data in memory as much as possible , Write to disk if memory cache is not available .
- It is generally not recommended to use DISK_ONLY And suffixes are _2 The level of : Because the data is read and write based on the disk file , Can cause a dramatic performance degradation , Sometimes it's better to recalculate all RDD. The suffix is _2 The level of , All data must be copied in one copy , And send it to other nodes , Data replication and network transmission will lead to large performance overhead , Unless high availability of the job is required , Otherwise, it is not recommended to use .
Conclusion
边栏推荐
- What is the difference between IP address and physical address
- prometheus api删除某个指定job的所有数据
- How to query the data of a certain day, a certain month, and a certain year in MySQL
- Mysql database backup script
- 无法将“pip”项识别为 cmdlet、函数、脚本文件或可运行程序的名称
- Enterprise log analysis system elk
- Shader basic UV operations, translation, rotation, scaling
- The differences between exit, exit (0), exit (1), exit ('0 '), exit ('1'), die and return in PHP
- 面试题 01.02. 判定是否互为字符重排-辅助数组算法
- 95. (cesium chapter) cesium dynamic monomer-3d building (building)
猜你喜欢
Good news! Kelan sundb database and Hongshu technology privacy data protection management software complete compatibility adaptation
Application example of infinite list [uigridview]
【Vulnhub靶场】THALES:1
Talk about the cloud deployment of local projects created by SAP IRPA studio
Tragedy caused by deleting the console statement
You Yuxi, coming!
Apache Doris just "graduated": why should we pay attention to this kind of SQL data warehouse?
华东师大团队提出,具有DNA调控电路的卷积神经网络的系统分子实现
统计学习方法——感知机
Logback logging framework third-party jar package is available for free
随机推荐
95.(cesium篇)cesium动态单体化-3D建筑物(楼栋)
无法将“pip”项识别为 cmdlet、函数、脚本文件或可运行程序的名称
Regular expression string
[hcsd celebrity live broadcast] teach the interview tips of big companies in person - brief notes
Performance comparison of tidb for PostgreSQL and yugabytedb on sysbench
A JS script can be directly put into the browser to perform operations
Detailed explanation of several ideas for implementing timed tasks in PHP
[flower carving experience] 15 try to build the Arduino development environment of beetle esp32 C3
SPI master rx time out中断
Multiplication in pytorch: mul (), multiply (), matmul (), mm (), MV (), dot ()
markdown公式编辑教程
How can laravel get the public path
hellogolang
【HCSD大咖直播】亲授大厂面试秘诀-简要笔记
如何快速检查钢网开口面积比是否符合 IPC7525
Set the route and optimize the URL in thinkphp3.2.3
安科瑞电网智能化发展的必然趋势电力系统采用微机保护装置是
pycharm 终端部启用虚拟环境
A link opens the applet code. After compilation, it is easy to understand
Xcode Revoke certificate