当前位置:网站首页>Checkpoint of RDD in spark
Checkpoint of RDD in spark
2022-07-06 21:44:00 【Big data Xiaochen】

RDD Of checkpoint Mechanism , because cache、persist Supported persistent storage media memory and disk are easy 【 The loss of 】, and HDFS Yes 【 High availability 】、【 Fault tolerance 】 Characteristics of , So will RDD The data is stored in HDFS On .
therefore checkpoint It also has the function of persistence , There's more 【 Safe and reliable 】 The function of .
Usage mode
First step :【sc.setCheckpointDir("hdfs://node1:8020/output/ckp/6_checkpoint")】 // Appoint HDFS The catalog of
The second step :【rdd.checkpoint()】// Frequently used later RDD、 Or very important RDD
Case study
Directly on the basis of the previous case , With a little modification , First specify HDFS The catalog of , then persist or cache Replace with checkpoint.
# -*- coding:utf-8 -*-
# Desc:This is Code Desc
import os
import json
import re
import time
from pyspark import SparkConf, SparkContext, StorageLevel
os.environ['SPARK_HOME'] = '/export/server/spark'
PYSPARK_PYTHON = "/root/anaconda3/bin/python3.8"
# When multiple versions exist , Failure to specify is likely to result in an error
os.environ["PYSPARK_PYTHON"] = PYSPARK_PYTHON
os.environ["PYSPARK_DRIVER_PYTHON"] = PYSPARK_PYTHON
if __name__ == '__main__':
#1- establish SparkContext Context object
conf=SparkConf().setAppName("2_rdd_from_external").setMaster("local[*]")
sc=SparkContext(conf=conf)
sc.setCheckpointDir("hdfs://node1:8020/output/ckp/6_checkpoint")
#2 Read the file
rdd=sc.textFile("file:///export/pyworkspace/pyspark_sz26/pyspark-sparkcore-3.1.2/data/apache.log")
#3 Wash and extract ip,\\s+ Represents general white space characters , such as tab key , Space , Line break ,\r\t
rdd2=rdd.map(lambda line : re.split('\\s+',line)[0])
# Yes rdd2 checkpoint Persist to HDFS
rdd2.checkpoint()
pv = rdd2.count()
#4 Calculation pv, Print
pv=rdd2.count()
print('pv=',pv)
#5 Calculation uv, Print
uv=rdd2.distinct().count()
print('uv=',uv)
time.sleep(600)
sc.stop()
result :

Persistence and Checkpoint The difference between
Location difference :persist or cache Will RDD The data is stored in 【 Memory 】、【 disk 】、【 Out of heap memory 】 in , however checkpoint Mechanism will RDD Data saved in 【HDFS】 On .
Life cycle : When Application completion of enforcement , Or call 【unpersist】, that persist or cache The data will be automatically cleared . however checkpoint Contents of the catalog 【 Can't 】 Automatic removal of , It needs to be cleared manually .
Consanguinity :persist or cache【 Meeting 】 Retain RDD By blood , If the data of a partition is lost , Then we can use 【 Dependent on kinship 】 Recalculate . however HDFS【 no need 】 Retain dependencies , Because even if the data of a partition is lost or damaged , Then it can also be used directly and conveniently HDFS In addition to 【2】 Copies .
边栏推荐
- SQL:存储过程和触发器~笔记
- 通过数字电视通过宽带网络取代互联网电视机顶盒应用
- In JS, string and array are converted to each other (II) -- the method of converting array into string
- 14年本科毕业,转行软件测试,薪资13.5K
- 启动嵌入式间:资源有限的系统启动
- Vim 基本配置和经常使用的命令
- Yyds dry inventory run kubeedge official example_ Counter demo counter
- JPEG2000 matlab source code implementation
- Tips for web development: skillfully use ThreadLocal to avoid layer by layer value transmission
- 强化学习-学习笔记5 | AlphaGo
猜你喜欢

【Redis设计与实现】第一部分 :Redis数据结构和对象 总结
![[Li Kou brush questions] 32 Longest valid bracket](/img/51/1ce4f9e8517dba214ec82b6567c923.png)
[Li Kou brush questions] 32 Longest valid bracket

Sequoia China, just raised $9billion

Efficiency tool +wps check box shows the solution to the sun problem

50个常用的Numpy函数解释,参数和使用示例

缓存更新策略概览(Caching Strategies Overview)
Why does MySQL index fail? When do I use indexes?
![[interpretation of the paper] machine learning technology for Cataract Classification / classification](/img/0c/b76e59f092c1b534736132faa76de5.png)
[interpretation of the paper] machine learning technology for Cataract Classification / classification

After working for 5 years, this experience is left when you reach P7. You have helped your friends get 10 offers

JPEG2000 matlab source code implementation
随机推荐
C language: comprehensive application of if, def and ifndef
It's not my boast. You haven't used this fairy idea plug-in!
快讯:飞书玩家大会线上举行;微信支付推出“教培服务工具箱”
High precision face recognition based on insightface, which can directly benchmark hongruan
Tiktok will push the independent grass planting app "praiseworthy". Can't bytes forget the little red book?
Hill | insert sort
FZU 1686 龙之谜 重复覆盖
中国白酒的5场大战
Tips for web development: skillfully use ThreadLocal to avoid layer by layer value transmission
数字化转型挂帅复产复工,线上线下全融合重建商业逻辑
JS learning notes OO create suspicious objects
Shake Sound poussera l'application indépendante de plantation d'herbe "louable", les octets ne peuvent pas oublier le petit livre rouge?
The use method of string is startwith () - start with XX, endswith () - end with XX, trim () - delete spaces at both ends
JS get array subscript through array content
MySQL - transaction details
Redistemplate common collection instructions opsforzset (VI)
Efficiency tool +wps check box shows the solution to the sun problem
Quick news: the flybook players' conference is held online; Wechat payment launched "education and training service toolbox"
c语言char, wchar_t, char16_t, char32_t和字符集的关系
ROS error: could not find a package configuration file provided by "move_base“