当前位置:网站首页>Checkpoint of RDD in spark
Checkpoint of RDD in spark
2022-07-06 21:44:00 【Big data Xiaochen】

RDD Of checkpoint Mechanism , because cache、persist Supported persistent storage media memory and disk are easy 【 The loss of 】, and HDFS Yes 【 High availability 】、【 Fault tolerance 】 Characteristics of , So will RDD The data is stored in HDFS On .
therefore checkpoint It also has the function of persistence , There's more 【 Safe and reliable 】 The function of .
Usage mode
First step :【sc.setCheckpointDir("hdfs://node1:8020/output/ckp/6_checkpoint")】 // Appoint HDFS The catalog of
The second step :【rdd.checkpoint()】// Frequently used later RDD、 Or very important RDD
Case study
Directly on the basis of the previous case , With a little modification , First specify HDFS The catalog of , then persist or cache Replace with checkpoint.
# -*- coding:utf-8 -*-
# Desc:This is Code Desc
import os
import json
import re
import time
from pyspark import SparkConf, SparkContext, StorageLevel
os.environ['SPARK_HOME'] = '/export/server/spark'
PYSPARK_PYTHON = "/root/anaconda3/bin/python3.8"
# When multiple versions exist , Failure to specify is likely to result in an error
os.environ["PYSPARK_PYTHON"] = PYSPARK_PYTHON
os.environ["PYSPARK_DRIVER_PYTHON"] = PYSPARK_PYTHON
if __name__ == '__main__':
#1- establish SparkContext Context object
conf=SparkConf().setAppName("2_rdd_from_external").setMaster("local[*]")
sc=SparkContext(conf=conf)
sc.setCheckpointDir("hdfs://node1:8020/output/ckp/6_checkpoint")
#2 Read the file
rdd=sc.textFile("file:///export/pyworkspace/pyspark_sz26/pyspark-sparkcore-3.1.2/data/apache.log")
#3 Wash and extract ip,\\s+ Represents general white space characters , such as tab key , Space , Line break ,\r\t
rdd2=rdd.map(lambda line : re.split('\\s+',line)[0])
# Yes rdd2 checkpoint Persist to HDFS
rdd2.checkpoint()
pv = rdd2.count()
#4 Calculation pv, Print
pv=rdd2.count()
print('pv=',pv)
#5 Calculation uv, Print
uv=rdd2.distinct().count()
print('uv=',uv)
time.sleep(600)
sc.stop()
result :

Persistence and Checkpoint The difference between
Location difference :persist or cache Will RDD The data is stored in 【 Memory 】、【 disk 】、【 Out of heap memory 】 in , however checkpoint Mechanism will RDD Data saved in 【HDFS】 On .
Life cycle : When Application completion of enforcement , Or call 【unpersist】, that persist or cache The data will be automatically cleared . however checkpoint Contents of the catalog 【 Can't 】 Automatic removal of , It needs to be cleared manually .
Consanguinity :persist or cache【 Meeting 】 Retain RDD By blood , If the data of a partition is lost , Then we can use 【 Dependent on kinship 】 Recalculate . however HDFS【 no need 】 Retain dependencies , Because even if the data of a partition is lost or damaged , Then it can also be used directly and conveniently HDFS In addition to 【2】 Copies .
边栏推荐
- string的底层实现
- Technology sharing | packet capturing analysis TCP protocol
- JPEG2000 matlab source code implementation
- Redistemplate common collection instructions opsforhash (IV)
- [interpretation of the paper] machine learning technology for Cataract Classification / classification
- Thinking about agile development
- 基于InsightFace的高精度人脸识别,可直接对标虹软
- JS traversal array and string
- OneNote in-depth evaluation: using resources, plug-ins, templates
- Uni app app half screen continuous code scanning
猜你喜欢

guava:Collections. The collection created by unmodifiablexxx is not immutable

Seven original sins of embedded development

After working for 5 years, this experience is left when you reach P7. You have helped your friends get 10 offers

Sequoia China, just raised $9billion

Set up a time server

HMS core machine learning service creates a new "sound" state of simultaneous interpreting translation, and AI makes international exchanges smoother

Internet News: Geely officially acquired Meizu; Intensive insulin purchase was fully implemented in 31 provinces

Quick news: the flybook players' conference is held online; Wechat payment launched "education and training service toolbox"

50个常用的Numpy函数解释,参数和使用示例
![[interpretation of the paper] machine learning technology for Cataract Classification / classification](/img/0c/b76e59f092c1b534736132faa76de5.png)
[interpretation of the paper] machine learning technology for Cataract Classification / classification
随机推荐
JS according to the Chinese Alphabet (province) or according to the English alphabet - Za sort &az sort
MySQL - transaction details
Vim 基本配置和经常使用的命令
Technology sharing | packet capturing analysis TCP protocol
Five wars of Chinese Baijiu
Numpy download and installation
爬虫实战(五):爬豆瓣top250
guava:Collections.unmodifiableXXX创建的collection并不immutable
string的底层实现
@Detailed differences among getmapping, @postmapping and @requestmapping, with actual combat code (all)
Fzu 1686 dragon mystery repeated coverage
ROS error: could not find a package configuration file provided by "move_base“
Tips for web development: skillfully use ThreadLocal to avoid layer by layer value transmission
PostgreSQL modifies the password of the database user
El table table - get the row and column you click & the sort of El table and sort change, El table column and sort method & clear sort clearsort
Internet News: Geely officially acquired Meizu; Intensive insulin purchase was fully implemented in 31 provinces
string的底层实现
Description of web function test
Search map website [quadratic] [for search map, search fan, search book]
Set up a time server