当前位置:网站首页>Spark persistence strategy_ Cache optimization
Spark persistence strategy_ Cache optimization
2022-07-26 08:31:00 【StephenYYYou】
Catalog
Spark Persistence strategy _ Cache optimization
MEMORY_ONLY and MEMORY_AND_DISK
Spark Persistence strategy _ Cache optimization
RDD Persistence strategy of
When a RDD When frequent reuse is required ,spark Provide RDD Persistence of , By using persist()、cache() Two methods are used RDD The persistence of . As shown below :
//scala
myRDD.persist()
myRDD.cache()Why use persistence ?
because RDD1 after Action Generate a new RDD2 after , The original RDD1 Will be deleted from memory , If it needs to be reused in the next operation RDD1,Spark Will go all the way up , Reread data , Then recalculate RDD1, Then calculate . This will increase the number of disks IO And cost calculation , Persistence saves data , Wait for the next time Action When using .
cache and persist Source code
def persist(newLevel: StorageLevel): this.type = {
if (isLocallyCheckpointed) {
// This means the user previously called localCheckpoint(), which should have already
// marked this RDD for persisting. Here we should override the old storage level with
// one that is explicitly requested by the user (after adapting it to use disk).
persist(LocalRDDCheckpointData.transformStorageLevel(newLevel), allowOverride = true)
} else {
persist(newLevel, allowOverride = false)
}
}
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
*/
def persist(): this.type = persist(StorageLevel.MEMORY_ONLY)
/**
* Persist this RDD with the default storage level (`MEMORY_ONLY`).
*/
def cache(): this.type = persist()We can see from the source code that , cache A method is actually a call passed without parameters persis Method , So we just need to study persist The method can . Without participation persist The default parameter is StorageLevel.MEMORY_ONLY, We can take a look at the class StorageLevel Source code .
/**
* Various [[org.apache.spark.storage.StorageLevel]] defined and utility functions for creating
* new storage levels.
*/
object StorageLevel {
val NONE = new StorageLevel(false, false, false, false)
val DISK_ONLY = new StorageLevel(true, false, false, false)
val DISK_ONLY_2 = new StorageLevel(true, false, false, false, 2)
val MEMORY_ONLY = new StorageLevel(false, true, false, true)
val MEMORY_ONLY_2 = new StorageLevel(false, true, false, true, 2)
val MEMORY_ONLY_SER = new StorageLevel(false, true, false, false)
val MEMORY_ONLY_SER_2 = new StorageLevel(false, true, false, false, 2)
val MEMORY_AND_DISK = new StorageLevel(true, true, false, true)
val MEMORY_AND_DISK_2 = new StorageLevel(true, true, false, true, 2)
val MEMORY_AND_DISK_SER = new StorageLevel(true, true, false, false)
val MEMORY_AND_DISK_SER_2 = new StorageLevel(true, true, false, false, 2)
val OFF_HEAP = new StorageLevel(true, true, true, false, 1)
/**
* :: DeveloperApi ::
* Return the StorageLevel object with the specified name.
*/
@DeveloperApi
def fromString(s: String): StorageLevel = s match {
case "NONE" => NONE
case "DISK_ONLY" => DISK_ONLY
case "DISK_ONLY_2" => DISK_ONLY_2
case "MEMORY_ONLY" => MEMORY_ONLY
case "MEMORY_ONLY_2" => MEMORY_ONLY_2
case "MEMORY_ONLY_SER" => MEMORY_ONLY_SER
case "MEMORY_ONLY_SER_2" => MEMORY_ONLY_SER_2
case "MEMORY_AND_DISK" => MEMORY_AND_DISK
case "MEMORY_AND_DISK_2" => MEMORY_AND_DISK_2
case "MEMORY_AND_DISK_SER" => MEMORY_AND_DISK_SER
case "MEMORY_AND_DISK_SER_2" => MEMORY_AND_DISK_SER_2
case "OFF_HEAP" => OFF_HEAP
case _ => throw new IllegalArgumentException(s"Invalid StorageLevel: $s")
}
/**
* :: DeveloperApi ::
* Create a new StorageLevel object.
*/
@DeveloperApi
def apply(
useDisk: Boolean,
useMemory: Boolean,
useOffHeap: Boolean,
deserialized: Boolean,
replication: Int): StorageLevel = {
getCachedStorageLevel(
new StorageLevel(useDisk, useMemory, useOffHeap, deserialized, replication))
}
/**
* :: DeveloperApi ::
* Create a new StorageLevel object without setting useOffHeap.
*/
@DeveloperApi
def apply(
useDisk: Boolean,
useMemory: Boolean,
deserialized: Boolean,
replication: Int = 1): StorageLevel = {
getCachedStorageLevel(new StorageLevel(useDisk, useMemory, false, deserialized, replication))
}You can see ,StorageLevel Parameters include :
| Parameters : Default | meaning |
| useDisk: Boolean | Whether to use disk for persistence |
| useMemory: Boolean | Whether to use memory for persistence |
| useOffHeap: Boolean | Whether to use JAVA Heap memory |
| deserialized: Boolean | Whether to serialize |
| replication:1 | replications ( Make fault tolerance ) |
So we can get :
- NONE: It's the default configuration
- DISK_ONLY: Only cached on disk
- DISK_ONLY_2: Only cache on disk and keep 2 Copies
- MEMORY_ONLY: Only cached in disk memory
- MEMORY_ONLY_2: Only cache in disk memory and keep 2 Copies
- MEMORY_ONLY_SER: Only cached in disk memory and serialized
- MEMORY_ONLY_SER_2: It is only cached in disk memory and serialized and maintained 2 Copies
- MEMORY_AND_DISK: Cache after memory is full , It will be cached on disk
- MEMORY_AND_DISK_2: Cache after memory is full , It will be cached on disk and kept 2 Copies
- MEMORY_AND_DISK_SER: Cache after memory is full , It will be cached on disk and serialized
- MEMORY_AND_DISK_SER_2: Cache after memory is full , It will be cached on disk and serialized , And keep 2 Copies
- OFF_HEAP: Cache away from heap memory
Serialization can be similar to compression , Easy to save storage space , But it will increase the calculation cost , Because each use requires serialization and deserialization ; The default number of copies is 1, To prevent data loss , Enhance fault tolerance ;OFF_HEAP take RDD Stored in etachyon On , Make it have lower garbage collection cost , Understanding can ;DISK_ONLY Nothing to say , The following are the main comparisons MEMORY_ONLY and MEMORY_AND_DISK.
MEMORY_ONLY and MEMORY_AND_DISK
MEMORY_ONLY:RDD Cache and memory only , Partitions that cannot fit in memory will be re read from disk and calculated when used .
MEMORY_AND_DISK: Try to save in memory , Partitions that cannot be saved will be saved on disk , The process of recalculation is avoided .
Intuitively ,MEMORY_ONLY The calculation process is also needed , Relatively low efficiency , But in fact , Because it is calculated in memory , So the recalculation time consumption is much less than that of disk IO Of , So... Is usually used by default MEMORY_ONLY. Unless the intermediate computing overhead is particularly large , Use at this time MEMORY_AND_DISK Would be a better choice .
summary

边栏推荐
- Write common API tools swagger and redoc
- 【EndNote】文献类型与文献类型缩写汇编
- Vscode domestic image server acceleration
- JS tool function Encyclopedia
- I am 35 years old.
- 2022-7-8 personal qualifying 5 competition experience (supplementary)
- 22-07-14 personal training match 2 competition experience
- Dear teachers, how can sqlserver get DDL in flinkcdc?
- sed作业
- 第三天作业
猜你喜欢

Super nice navigation page (static page)

CV learning notes (optical flow)

【EndNote】文献模板编排语法详解

BGP routing principle

Understand microservices bit by bit
![[C language] programmer's basic skill method -](/img/0e/e9111d4b341cc42aa4382b5fbd0001.png)
[C language] programmer's basic skill method - "creation and destruction of function stack frames"
![[GUI] GUI programming; AWT package (interface properties, layout management, event monitoring)](/img/25/475c91d7e673fda3930e5a69be0f28.png)
[GUI] GUI programming; AWT package (interface properties, layout management, event monitoring)

Let's talk about the three core issues of concurrent programming.

Grid segmentation

Flitter imitates wechat long press pop-up copy recall paste collection and other custom customization
随机推荐
CV learning notes (optical flow)
The most complete network: detailed explanation of six constraints of MySQL
If Yi Lijing spits about programmers
Exam summary on July 15, 2022
Kotlin variables and constants
Template summary
请问flink sql client 在sink表,有什么办法增大写出速率吗。通过sink表的同步时
全网最全:Mysql六种约束详解
分享高压超低噪声LDO测试结果(High Voltage Ultra-low Noise LDO)
Flitter imitates wechat long press pop-up copy recall paste collection and other custom customization
Special Lecture 3 number theory + game theory learning experience (should be updated for a long time)
Nodejs2day(nodejs的模块化,npm下载包,模块加载机制)
Special lecture 2 dynamic planning learning experience (should be updated for a long time)
Awk operation
Differences and connections of previewkeydown, Keydown, keypress and Keyup in C WinForm
2022/7/6 exam summary
Mycat2 sub database and sub table
Grid segmentation
Dear teachers, how can sqlserver get DDL in flinkcdc?
QT note 2