当前位置:网站首页>Persistence / caching of RDD in spark
Persistence / caching of RDD in spark
2022-07-06 21:44:00 【Big data Xiaochen】
Yes RDD In the process of conversion , If you want to treat someone in the middle RDD Multiple reuse , For example, yes. RDD Output multiple times , So by default, every time Action Will trigger a job, Every job Will load data from scratch and calculate , A waste of time . If the logical RDDN Data persistence to specific storage media, such as 【 Memory 】、【 disk 】、【 Out of heap memory 】, Then only calculate this once RDD, Improve program performance
RDD call cache/persist All are 【lazy 】 operator , Need one 【Action】 After operator trigger ,( Usually use count To trigger ).RDD Data will be persisted to memory or disk . Later operations , Will get data directly from memory or disk .
below 3 Both are only persistent to 【 Memory 】
rdd.persist()
rdd.cache()
rdd.persist(StorageLevel.MEMORY_ONLY)
More storage levels
rdd.persist(level : StorageLevel)
StorageLevel
_ONLY: Just save the data to 【 Memory 】 or 【 disk 】
_2: Backup when data is persistent 【2】 Share
_SER: take RDD The elements of 【 serialize 】, Compress , Convenient network transmission .
MEMORY_AND_DISK_SER_2 : Put the data 【 serialize 】 Save to memory , If 【 Memory 】 Not enough , Continue to overflow 【 disk 】, And backup 2 Time .
Release cache / Persistence
When caching RDD When data is no longer used , Consider releasing resources
rdd.unpersit()
# -*- coding:utf-8 -*-
# Desc:This is Code Desc
import os
import json
import re
import time
from pyspark import SparkConf, SparkContext
os.environ['SPARK_HOME'] = '/export/server/spark'
PYSPARK_PYTHON = "/root/anaconda3/bin/python3.8"
# When multiple versions exist , Failure to specify is likely to result in an error
os.environ["PYSPARK_PYTHON"] = PYSPARK_PYTHON
os.environ["PYSPARK_DRIVER_PYTHON"] = PYSPARK_PYTHON
if __name__ == '__main__':
#1- establish SparkContext Context object
conf=SparkConf().setAppName("2_rdd_from_external").setMaster("local[*]")
sc=SparkContext(conf=conf)
#2 Read the file
rdd=sc.textFile("file:///export/pyworkspace/pyspark_sz26/pyspark-sparkcore-3.1.2/data/apache.log")
#3 Wash and extract ip,\\s+ Represents general white space characters , such as tab key , Space , Line break ,\r\t
rdd2=rdd.map(lambda line : re.split('\\s+',line)[0])
# Yes rdd2 Cache persistence
rdd2.cache()
pv = rdd2.count()
#4 Calculation pv, Print
pv=rdd2.count()
print('pv=',pv)
#5 Calculation uv, Print
uv=rdd2.distinct().count()
print('uv=',uv)
time.sleep(600)
sc.stop()
When to use cache/persist ?
When RDD By 【 many 】 Secondary multiplexing
When RDD The previous calculation process is very 【 Complex and expensive 】( Such as through 【JDBC】 Come to ), And it has been used many times .
边栏推荐
- 14年本科毕业,转行软件测试,薪资13.5K
- PostgreSQL 修改数据库用户的密码
- [in depth learning] pytorch 1.12 was released, officially supporting Apple M1 chip GPU acceleration and repairing many bugs
- Chris LATTNER, the father of llvm: why should we rebuild AI infrastructure software
- 分糖果
- JS operation DOM element (I) -- six ways to obtain DOM nodes
- 互联网快讯:吉利正式收购魅族;胰岛素集采在31省全面落地
- 对话阿里巴巴副总裁贾扬清:追求大模型,并不是一件坏事
- JPEG2000-Matlab源码实现
- Set up a time server
猜你喜欢
Michael smashed the minority milk sign
跨分片方案 总结
一行代码可以做些什么?
PostgreSQL modifies the password of the database user
[redis design and implementation] part I: summary of redis data structure and objects
1292_FreeROS中vTaskResume()以及xTaskResumeFromISR()的实现分析
【Redis设计与实现】第一部分 :Redis数据结构和对象 总结
50个常用的Numpy函数解释,参数和使用示例
Leetcode topic [array] -118 Yang Hui triangle
Uni app app half screen continuous code scanning
随机推荐
JPEG2000-Matlab源码实现
From campus to Tencent work for a year of those stumbles!
Digital transformation takes the lead to resume production and work, and online and offline full integration rebuilds business logic
[interpretation of the paper] machine learning technology for Cataract Classification / classification
Vit paper details
1292_ Implementation analysis of vtask resume() and xtask resume fromisr() in freeros
Explain ESM module and commonjs module in simple terms
Univariate cubic equation - relationship between root and coefficient
50个常用的Numpy函数解释,参数和使用示例
KDD 2022 | realize unified conversational recommendation through knowledge enhanced prompt learning
1D convolution detail
Internet News: Geely officially acquired Meizu; Intensive insulin purchase was fully implemented in 31 provinces
Sql: stored procedures and triggers - Notes
Replace Internet TV set-top box application through digital TV and broadband network
string的底层实现
guava:Collections. The collection created by unmodifiablexxx is not immutable
Redistemplate common collection instructions opsforhash (IV)
[sliding window] group B of the 9th Landbridge cup provincial tournament: log statistics
Comparison between multithreaded CAS and synchronized
缓存更新策略概览(Caching Strategies Overview)