当前位置:网站首页>Persistence / caching of RDD in spark
Persistence / caching of RDD in spark
2022-07-06 21:44:00 【Big data Xiaochen】

Yes RDD In the process of conversion , If you want to treat someone in the middle RDD Multiple reuse , For example, yes. RDD Output multiple times , So by default, every time Action Will trigger a job, Every job Will load data from scratch and calculate , A waste of time . If the logical RDDN Data persistence to specific storage media, such as 【 Memory 】、【 disk 】、【 Out of heap memory 】, Then only calculate this once RDD, Improve program performance
RDD call cache/persist All are 【lazy 】 operator , Need one 【Action】 After operator trigger ,( Usually use count To trigger ).RDD Data will be persisted to memory or disk . Later operations , Will get data directly from memory or disk .
below 3 Both are only persistent to 【 Memory 】
rdd.persist()
rdd.cache()
rdd.persist(StorageLevel.MEMORY_ONLY)
More storage levels
rdd.persist(level : StorageLevel)
StorageLevel
_ONLY: Just save the data to 【 Memory 】 or 【 disk 】
_2: Backup when data is persistent 【2】 Share
_SER: take RDD The elements of 【 serialize 】, Compress , Convenient network transmission .
MEMORY_AND_DISK_SER_2 : Put the data 【 serialize 】 Save to memory , If 【 Memory 】 Not enough , Continue to overflow 【 disk 】, And backup 2 Time .
Release cache / Persistence
When caching RDD When data is no longer used , Consider releasing resources
rdd.unpersit()
# -*- coding:utf-8 -*-
# Desc:This is Code Desc
import os
import json
import re
import time
from pyspark import SparkConf, SparkContext
os.environ['SPARK_HOME'] = '/export/server/spark'
PYSPARK_PYTHON = "/root/anaconda3/bin/python3.8"
# When multiple versions exist , Failure to specify is likely to result in an error
os.environ["PYSPARK_PYTHON"] = PYSPARK_PYTHON
os.environ["PYSPARK_DRIVER_PYTHON"] = PYSPARK_PYTHON
if __name__ == '__main__':
#1- establish SparkContext Context object
conf=SparkConf().setAppName("2_rdd_from_external").setMaster("local[*]")
sc=SparkContext(conf=conf)
#2 Read the file
rdd=sc.textFile("file:///export/pyworkspace/pyspark_sz26/pyspark-sparkcore-3.1.2/data/apache.log")
#3 Wash and extract ip,\\s+ Represents general white space characters , such as tab key , Space , Line break ,\r\t
rdd2=rdd.map(lambda line : re.split('\\s+',line)[0])
# Yes rdd2 Cache persistence
rdd2.cache()
pv = rdd2.count()
#4 Calculation pv, Print
pv=rdd2.count()
print('pv=',pv)
#5 Calculation uv, Print
uv=rdd2.distinct().count()
print('uv=',uv)
time.sleep(600)
sc.stop()
When to use cache/persist ?
When RDD By 【 many 】 Secondary multiplexing
When RDD The previous calculation process is very 【 Complex and expensive 】( Such as through 【JDBC】 Come to ), And it has been used many times .
边栏推荐
- Caching strategies overview
- Dialogue with Jia Yangqing, vice president of Alibaba: pursuing a big model is not a bad thing
- WEB功能测试说明
- Quick access to video links at station B
- Tips for web development: skillfully use ThreadLocal to avoid layer by layer value transmission
- 1D convolution detail
- Quick news: the flybook players' conference is held online; Wechat payment launched "education and training service toolbox"
- High precision face recognition based on insightface, which can directly benchmark hongruan
- ROS error: could not find a package configuration file provided by "move_base“
- [Li Kou brush questions] 32 Longest valid bracket
猜你喜欢

Numpy download and installation

Uni app app half screen continuous code scanning

对话阿里巴巴副总裁贾扬清:追求大模型,并不是一件坏事

Sequoia China, just raised $9billion

JPEG2000-Matlab源码实现
![[in depth learning] pytorch 1.12 was released, officially supporting Apple M1 chip GPU acceleration and repairing many bugs](/img/66/4d94ae24e99599891636013ed734c5.png)
[in depth learning] pytorch 1.12 was released, officially supporting Apple M1 chip GPU acceleration and repairing many bugs
![Leetcode topic [array] -118 Yang Hui triangle](/img/77/d8a7085968cc443260b4c0910bd04b.jpg)
Leetcode topic [array] -118 Yang Hui triangle

JS method to stop foreach

PostgreSQL install GIS plug-in create extension PostGIS_ topology

抖音将推独立种草App“可颂”,字节忘不掉小红书?
随机推荐
红杉中国,刚刚募资90亿美元
What can one line of code do?
R语言做文本挖掘 Part4文本分类
JPEG2000 matlab source code implementation
R3live notes: image processing section
Description of web function test
[Digital IC manual tearing code] Verilog automatic beverage machine | topic | principle | design | simulation
The underlying implementation of string
1292_FreeROS中vTaskResume()以及xTaskResumeFromISR()的实现分析
NPM run dev start project error document is not defined
c语言char, wchar_t, char16_t, char32_t和字符集的关系
Enhance network security of kubernetes with cilium
039. (2.8) thoughts in the ward
Dialogue with Jia Yangqing, vice president of Alibaba: pursuing a big model is not a bad thing
Reinforcement learning - learning notes 5 | alphago
Thinking about agile development
Ravendb starts -- document metadata
Absolute primes (C language)
Yuan Xiaolin: safety is not only a standard, but also Volvo's unchanging belief and pursuit
Happy sound 2[sing.2]