当前位置:网站首页>Checkpoint of RDD in spark
Checkpoint of RDD in spark
2022-07-06 21:44:00 【Big data Xiaochen】
RDD Of checkpoint Mechanism , because cache、persist Supported persistent storage media memory and disk are easy 【 The loss of 】, and HDFS Yes 【 High availability 】、【 Fault tolerance 】 Characteristics of , So will RDD The data is stored in HDFS On .
therefore checkpoint It also has the function of persistence , There's more 【 Safe and reliable 】 The function of .
Usage mode
First step :【sc.setCheckpointDir("hdfs://node1:8020/output/ckp/6_checkpoint")】 // Appoint HDFS The catalog of
The second step :【rdd.checkpoint()】// Frequently used later RDD、 Or very important RDD
Case study
Directly on the basis of the previous case , With a little modification , First specify HDFS The catalog of , then persist or cache Replace with checkpoint.
# -*- coding:utf-8 -*-
# Desc:This is Code Desc
import os
import json
import re
import time
from pyspark import SparkConf, SparkContext, StorageLevel
os.environ['SPARK_HOME'] = '/export/server/spark'
PYSPARK_PYTHON = "/root/anaconda3/bin/python3.8"
# When multiple versions exist , Failure to specify is likely to result in an error
os.environ["PYSPARK_PYTHON"] = PYSPARK_PYTHON
os.environ["PYSPARK_DRIVER_PYTHON"] = PYSPARK_PYTHON
if __name__ == '__main__':
#1- establish SparkContext Context object
conf=SparkConf().setAppName("2_rdd_from_external").setMaster("local[*]")
sc=SparkContext(conf=conf)
sc.setCheckpointDir("hdfs://node1:8020/output/ckp/6_checkpoint")
#2 Read the file
rdd=sc.textFile("file:///export/pyworkspace/pyspark_sz26/pyspark-sparkcore-3.1.2/data/apache.log")
#3 Wash and extract ip,\\s+ Represents general white space characters , such as tab key , Space , Line break ,\r\t
rdd2=rdd.map(lambda line : re.split('\\s+',line)[0])
# Yes rdd2 checkpoint Persist to HDFS
rdd2.checkpoint()
pv = rdd2.count()
#4 Calculation pv, Print
pv=rdd2.count()
print('pv=',pv)
#5 Calculation uv, Print
uv=rdd2.distinct().count()
print('uv=',uv)
time.sleep(600)
sc.stop()
result :
Persistence and Checkpoint The difference between
Location difference :persist or cache Will RDD The data is stored in 【 Memory 】、【 disk 】、【 Out of heap memory 】 in , however checkpoint Mechanism will RDD Data saved in 【HDFS】 On .
Life cycle : When Application completion of enforcement , Or call 【unpersist】, that persist or cache The data will be automatically cleared . however checkpoint Contents of the catalog 【 Can't 】 Automatic removal of , It needs to be cleared manually .
Consanguinity :persist or cache【 Meeting 】 Retain RDD By blood , If the data of a partition is lost , Then we can use 【 Dependent on kinship 】 Recalculate . however HDFS【 no need 】 Retain dependencies , Because even if the data of a partition is lost or damaged , Then it can also be used directly and conveniently HDFS In addition to 【2】 Copies .
边栏推荐
- Binary tree node at the longest distance
- 038. (2.7) less anxiety
- Nodejs教程之让我们用 typescript 创建你的第一个 expressjs 应用程序
- 强化学习-学习笔记5 | AlphaGo
- Tips for web development: skillfully use ThreadLocal to avoid layer by layer value transmission
- Description of web function test
- Reinforcement learning - learning notes 5 | alphago
- Ravendb starts -- document metadata
- JS according to the Chinese Alphabet (province) or according to the English alphabet - Za sort &az sort
- 【力扣刷题】一维动态规划记录(53零钱兑换、300最长递增子序列、53最大子数组和)
猜你喜欢
跨分片方案 总结
20220211 failure - maximum amount of data supported by mongodb
Tiktok will push the independent grass planting app "praiseworthy". Can't bytes forget the little red book?
guava:Collections.unmodifiableXXX创建的collection并不immutable
It's not my boast. You haven't used this fairy idea plug-in!
Digital transformation takes the lead to resume production and work, and online and offline full integration rebuilds business logic
Z function (extended KMP)
039. (2.8) thoughts in the ward
Sequoia China, just raised $9billion
[redis design and implementation] part I: summary of redis data structure and objects
随机推荐
ACdreamoj1110(多重背包)
guava:创建immutableXxx对象的3种方式
Thinking about agile development
string的底层实现
guava:Collections. The collection created by unmodifiablexxx is not immutable
抖音將推獨立種草App“可頌”,字節忘不掉小紅書?
技术分享 | 抓包分析 TCP 协议
Efficiency tool +wps check box shows the solution to the sun problem
[Digital IC manual tearing code] Verilog automatic beverage machine | topic | principle | design | simulation
1D convolution detail
[interpretation of the paper] machine learning technology for Cataract Classification / classification
ViT论文详解
在最长的距离二叉树结点
What can one line of code do?
[sliding window] group B of the 9th Landbridge cup provincial tournament: log statistics
中国白酒的5场大战
Univariate cubic equation - relationship between root and coefficient
mysql根据两个字段去重
Tiktok will push the independent grass planting app "praiseworthy". Can't bytes forget the little red book?
First batch selected! Tencent security tianyufeng control has obtained the business security capability certification of the ICT Institute