当前位置:网站首页>Persistence / caching of RDD in spark
Persistence / caching of RDD in spark
2022-07-06 21:44:00 【Big data Xiaochen】
Yes RDD In the process of conversion , If you want to treat someone in the middle RDD Multiple reuse , For example, yes. RDD Output multiple times , So by default, every time Action Will trigger a job, Every job Will load data from scratch and calculate , A waste of time . If the logical RDDN Data persistence to specific storage media, such as 【 Memory 】、【 disk 】、【 Out of heap memory 】, Then only calculate this once RDD, Improve program performance
RDD call cache/persist All are 【lazy 】 operator , Need one 【Action】 After operator trigger ,( Usually use count To trigger ).RDD Data will be persisted to memory or disk . Later operations , Will get data directly from memory or disk .
below 3 Both are only persistent to 【 Memory 】
rdd.persist()
rdd.cache()
rdd.persist(StorageLevel.MEMORY_ONLY)
More storage levels
rdd.persist(level : StorageLevel)
StorageLevel
_ONLY: Just save the data to 【 Memory 】 or 【 disk 】
_2: Backup when data is persistent 【2】 Share
_SER: take RDD The elements of 【 serialize 】, Compress , Convenient network transmission .
MEMORY_AND_DISK_SER_2 : Put the data 【 serialize 】 Save to memory , If 【 Memory 】 Not enough , Continue to overflow 【 disk 】, And backup 2 Time .
Release cache / Persistence
When caching RDD When data is no longer used , Consider releasing resources
rdd.unpersit()
# -*- coding:utf-8 -*-
# Desc:This is Code Desc
import os
import json
import re
import time
from pyspark import SparkConf, SparkContext
os.environ['SPARK_HOME'] = '/export/server/spark'
PYSPARK_PYTHON = "/root/anaconda3/bin/python3.8"
# When multiple versions exist , Failure to specify is likely to result in an error
os.environ["PYSPARK_PYTHON"] = PYSPARK_PYTHON
os.environ["PYSPARK_DRIVER_PYTHON"] = PYSPARK_PYTHON
if __name__ == '__main__':
#1- establish SparkContext Context object
conf=SparkConf().setAppName("2_rdd_from_external").setMaster("local[*]")
sc=SparkContext(conf=conf)
#2 Read the file
rdd=sc.textFile("file:///export/pyworkspace/pyspark_sz26/pyspark-sparkcore-3.1.2/data/apache.log")
#3 Wash and extract ip,\\s+ Represents general white space characters , such as tab key , Space , Line break ,\r\t
rdd2=rdd.map(lambda line : re.split('\\s+',line)[0])
# Yes rdd2 Cache persistence
rdd2.cache()
pv = rdd2.count()
#4 Calculation pv, Print
pv=rdd2.count()
print('pv=',pv)
#5 Calculation uv, Print
uv=rdd2.distinct().count()
print('uv=',uv)
time.sleep(600)
sc.stop()
When to use cache/persist ?
When RDD By 【 many 】 Secondary multiplexing
When RDD The previous calculation process is very 【 Complex and expensive 】( Such as through 【JDBC】 Come to ), And it has been used many times .
边栏推荐
- OneNote in-depth evaluation: using resources, plug-ins, templates
- Nodejs tutorial expressjs article quick start
- 50 commonly used numpy function explanations, parameters and usage examples
- The underlying implementation of string
- 50个常用的Numpy函数解释,参数和使用示例
- 红杉中国,刚刚募资90亿美元
- The underlying implementation of string
- R3live notes: image processing section
- VIM basic configuration and frequently used commands
- El table table - sortable sorting & disordered sorting when decimal and% appear
猜你喜欢
跨分片方案 总结
[sliding window] group B of the 9th Landbridge cup provincial tournament: log statistics
Seven original sins of embedded development
Absolute primes (C language)
【力扣刷题】一维动态规划记录(53零钱兑换、300最长递增子序列、53最大子数组和)
Is this the feeling of being spoiled by bytes?
[Li Kou brush questions] 32 Longest valid bracket
Why does MySQL index fail? When do I use indexes?
JPEG2000-Matlab源码实现
抖音將推獨立種草App“可頌”,字節忘不掉小紅書?
随机推荐
JS get array subscript through array content
Divide candy
缓存更新策略概览(Caching Strategies Overview)
The underlying implementation of string
Binary tree node at the longest distance
在Pi和Jetson nano上运行深度网络,程序被Killed
JPEG2000-Matlab源码实现
[interpretation of the paper] machine learning technology for Cataract Classification / classification
抖音将推独立种草App“可颂”,字节忘不掉小红书?
uni-app App端半屏连续扫码
Yuan Xiaolin: safety is not only a standard, but also Volvo's unchanging belief and pursuit
【力扣刷题】32. 最长有效括号
El table table - sortable sorting & disordered sorting when decimal and% appear
Is this the feeling of being spoiled by bytes?
嵌入式开发的7大原罪
Uni app app half screen continuous code scanning
SDL2来源分析7:演出(SDL_RenderPresent())
Explain ESM module and commonjs module in simple terms
麦趣尔砸了小众奶招牌
c语言char, wchar_t, char16_t, char32_t和字符集的关系