当前位置:网站首页>Persistence / caching of RDD in spark
Persistence / caching of RDD in spark
2022-07-06 21:44:00 【Big data Xiaochen】

Yes RDD In the process of conversion , If you want to treat someone in the middle RDD Multiple reuse , For example, yes. RDD Output multiple times , So by default, every time Action Will trigger a job, Every job Will load data from scratch and calculate , A waste of time . If the logical RDDN Data persistence to specific storage media, such as 【 Memory 】、【 disk 】、【 Out of heap memory 】, Then only calculate this once RDD, Improve program performance
RDD call cache/persist All are 【lazy 】 operator , Need one 【Action】 After operator trigger ,( Usually use count To trigger ).RDD Data will be persisted to memory or disk . Later operations , Will get data directly from memory or disk .
below 3 Both are only persistent to 【 Memory 】
rdd.persist()
rdd.cache()
rdd.persist(StorageLevel.MEMORY_ONLY)
More storage levels
rdd.persist(level : StorageLevel)
StorageLevel
_ONLY: Just save the data to 【 Memory 】 or 【 disk 】
_2: Backup when data is persistent 【2】 Share
_SER: take RDD The elements of 【 serialize 】, Compress , Convenient network transmission .
MEMORY_AND_DISK_SER_2 : Put the data 【 serialize 】 Save to memory , If 【 Memory 】 Not enough , Continue to overflow 【 disk 】, And backup 2 Time .
Release cache / Persistence
When caching RDD When data is no longer used , Consider releasing resources
rdd.unpersit()
# -*- coding:utf-8 -*-
# Desc:This is Code Desc
import os
import json
import re
import time
from pyspark import SparkConf, SparkContext
os.environ['SPARK_HOME'] = '/export/server/spark'
PYSPARK_PYTHON = "/root/anaconda3/bin/python3.8"
# When multiple versions exist , Failure to specify is likely to result in an error
os.environ["PYSPARK_PYTHON"] = PYSPARK_PYTHON
os.environ["PYSPARK_DRIVER_PYTHON"] = PYSPARK_PYTHON
if __name__ == '__main__':
#1- establish SparkContext Context object
conf=SparkConf().setAppName("2_rdd_from_external").setMaster("local[*]")
sc=SparkContext(conf=conf)
#2 Read the file
rdd=sc.textFile("file:///export/pyworkspace/pyspark_sz26/pyspark-sparkcore-3.1.2/data/apache.log")
#3 Wash and extract ip,\\s+ Represents general white space characters , such as tab key , Space , Line break ,\r\t
rdd2=rdd.map(lambda line : re.split('\\s+',line)[0])
# Yes rdd2 Cache persistence
rdd2.cache()
pv = rdd2.count()
#4 Calculation pv, Print
pv=rdd2.count()
print('pv=',pv)
#5 Calculation uv, Print
uv=rdd2.distinct().count()
print('uv=',uv)
time.sleep(600)
sc.stop()
When to use cache/persist ?
When RDD By 【 many 】 Secondary multiplexing
When RDD The previous calculation process is very 【 Complex and expensive 】( Such as through 【JDBC】 Come to ), And it has been used many times .
边栏推荐
- Nodejs教程之Expressjs一篇文章快速入门
- guava:Collections. The collection created by unmodifiablexxx is not immutable
- Sdl2 source analysis 7: performance (sdl_renderpresent())
- Proxy and reverse proxy
- 红杉中国,刚刚募资90亿美元
- OneNote in-depth evaluation: using resources, plug-ins, templates
- Why do job hopping take more than promotion?
- HMS core machine learning service creates a new "sound" state of simultaneous interpreting translation, and AI makes international exchanges smoother
- [in depth learning] pytorch 1.12 was released, officially supporting Apple M1 chip GPU acceleration and repairing many bugs
- The underlying implementation of string
猜你喜欢
![[Li Kou brush questions] 32 Longest valid bracket](/img/51/1ce4f9e8517dba214ec82b6567c923.png)
[Li Kou brush questions] 32 Longest valid bracket
![[redis design and implementation] part I: summary of redis data structure and objects](/img/2e/b147aa1e23757519a5d049c88113fe.png)
[redis design and implementation] part I: summary of redis data structure and objects

红杉中国,刚刚募资90亿美元

Set up a time server

麦趣尔砸了小众奶招牌

Uni app app half screen continuous code scanning

OneNote in-depth evaluation: using resources, plug-ins, templates

This year, Jianzhi Tencent

Chris LATTNER, the father of llvm: why should we rebuild AI infrastructure software

PostgreSQL modifies the password of the database user
随机推荐
Proxy and reverse proxy
document. Usage of write () - write text - modify style and position control
It's not my boast. You haven't used this fairy idea plug-in!
首批入选!腾讯安全天御风控获信通院业务安全能力认证
JPEG2000-Matlab源码实现
WEB功能测试说明
PostgreSQL 安装gis插件 CREATE EXTENSION postgis_topology
Absolute primes (C language)
HMS core machine learning service creates a new "sound" state of simultaneous interpreting translation, and AI makes international exchanges smoother
What can one line of code do?
Redistemplate common collection instructions opsforlist (III)
【滑动窗口】第九届蓝桥杯省赛B组:日志统计
代理和反向代理
guava:Collections.unmodifiableXXX创建的collection并不immutable
KDD 2022 | realize unified conversational recommendation through knowledge enhanced prompt learning
一行代码可以做些什么?
C语言:#if、#def和#ifndef综合应用
ROS error: could not find a package configuration file provided by "move_base“
The relationship between root and coefficient of quadratic equation with one variable
Nodejs tutorial expressjs article quick start