当前位置:网站首页>Checkpoint of RDD in spark
Checkpoint of RDD in spark
2022-07-06 21:44:00 【Big data Xiaochen】

RDD Of checkpoint Mechanism , because cache、persist Supported persistent storage media memory and disk are easy 【 The loss of 】, and HDFS Yes 【 High availability 】、【 Fault tolerance 】 Characteristics of , So will RDD The data is stored in HDFS On .
therefore checkpoint It also has the function of persistence , There's more 【 Safe and reliable 】 The function of .
Usage mode
First step :【sc.setCheckpointDir("hdfs://node1:8020/output/ckp/6_checkpoint")】 // Appoint HDFS The catalog of
The second step :【rdd.checkpoint()】// Frequently used later RDD、 Or very important RDD
Case study
Directly on the basis of the previous case , With a little modification , First specify HDFS The catalog of , then persist or cache Replace with checkpoint.
# -*- coding:utf-8 -*-
# Desc:This is Code Desc
import os
import json
import re
import time
from pyspark import SparkConf, SparkContext, StorageLevel
os.environ['SPARK_HOME'] = '/export/server/spark'
PYSPARK_PYTHON = "/root/anaconda3/bin/python3.8"
# When multiple versions exist , Failure to specify is likely to result in an error
os.environ["PYSPARK_PYTHON"] = PYSPARK_PYTHON
os.environ["PYSPARK_DRIVER_PYTHON"] = PYSPARK_PYTHON
if __name__ == '__main__':
#1- establish SparkContext Context object
conf=SparkConf().setAppName("2_rdd_from_external").setMaster("local[*]")
sc=SparkContext(conf=conf)
sc.setCheckpointDir("hdfs://node1:8020/output/ckp/6_checkpoint")
#2 Read the file
rdd=sc.textFile("file:///export/pyworkspace/pyspark_sz26/pyspark-sparkcore-3.1.2/data/apache.log")
#3 Wash and extract ip,\\s+ Represents general white space characters , such as tab key , Space , Line break ,\r\t
rdd2=rdd.map(lambda line : re.split('\\s+',line)[0])
# Yes rdd2 checkpoint Persist to HDFS
rdd2.checkpoint()
pv = rdd2.count()
#4 Calculation pv, Print
pv=rdd2.count()
print('pv=',pv)
#5 Calculation uv, Print
uv=rdd2.distinct().count()
print('uv=',uv)
time.sleep(600)
sc.stop()
result :

Persistence and Checkpoint The difference between
Location difference :persist or cache Will RDD The data is stored in 【 Memory 】、【 disk 】、【 Out of heap memory 】 in , however checkpoint Mechanism will RDD Data saved in 【HDFS】 On .
Life cycle : When Application completion of enforcement , Or call 【unpersist】, that persist or cache The data will be automatically cleared . however checkpoint Contents of the catalog 【 Can't 】 Automatic removal of , It needs to be cleared manually .
Consanguinity :persist or cache【 Meeting 】 Retain RDD By blood , If the data of a partition is lost , Then we can use 【 Dependent on kinship 】 Recalculate . however HDFS【 no need 】 Retain dependencies , Because even if the data of a partition is lost or damaged , Then it can also be used directly and conveniently HDFS In addition to 【2】 Copies .
边栏推荐
- Vit paper details
- 14年本科毕业,转行软件测试,薪资13.5K
- R language for text mining Part4 text classification
- How do I remove duplicates from the list- How to remove duplicates from a list?
- JS learning notes OO create suspicious objects
- JPEG2000 matlab source code implementation
- Caching strategies overview
- 3D face reconstruction: from basic knowledge to recognition / reconstruction methods!
- PostgreSQL modifies the password of the database user
- 袁小林:安全不只是标准,更是沃尔沃不变的信仰和追求
猜你喜欢

What can one line of code do?

【Redis设计与实现】第一部分 :Redis数据结构和对象 总结
![Leetcode topic [array] -118 Yang Hui triangle](/img/77/d8a7085968cc443260b4c0910bd04b.jpg)
Leetcode topic [array] -118 Yang Hui triangle

The difference between break and continue in the for loop -- break completely end the loop & continue terminate this loop

Caching strategies overview

Absolute primes (C language)

PostgreSQL 修改数据库用户的密码

PostgreSQL install GIS plug-in create extension PostGIS_ topology

Enhance network security of kubernetes with cilium

【力扣刷题】一维动态规划记录(53零钱兑换、300最长递增子序列、53最大子数组和)
随机推荐
[Li Kou brushing questions] one dimensional dynamic planning record (53 change exchanges, 300 longest increasing subsequence, 53 largest subarray and)
C语言:#if、#def和#ifndef综合应用
互联网快讯:吉利正式收购魅族;胰岛素集采在31省全面落地
JPEG2000 matlab source code implementation
Numpy download and installation
Redistemplate common collection instructions opsforhash (IV)
[interpretation of the paper] machine learning technology for Cataract Classification / classification
[go][转载]vscode配置完go跑个helloworld例子
JS operation DOM element (I) -- six ways to obtain DOM nodes
Guava: three ways to create immutablexxx objects
Technology sharing | packet capturing analysis TCP protocol
跨分片方案 总结
Quick access to video links at station B
【力扣刷题】一维动态规划记录(53零钱兑换、300最长递增子序列、53最大子数组和)
Seven original sins of embedded development
ROS error: could not find a package configuration file provided by "move_base“
技术分享 | 抓包分析 TCP 协议
The underlying implementation of string
guava:Collections. The collection created by unmodifiablexxx is not immutable
中国白酒的5场大战