当前位置:网站首页>Checkpoint of RDD in spark
Checkpoint of RDD in spark
2022-07-06 21:44:00 【Big data Xiaochen】
RDD Of checkpoint Mechanism , because cache、persist Supported persistent storage media memory and disk are easy 【 The loss of 】, and HDFS Yes 【 High availability 】、【 Fault tolerance 】 Characteristics of , So will RDD The data is stored in HDFS On .
therefore checkpoint It also has the function of persistence , There's more 【 Safe and reliable 】 The function of .
Usage mode
First step :【sc.setCheckpointDir("hdfs://node1:8020/output/ckp/6_checkpoint")】 // Appoint HDFS The catalog of
The second step :【rdd.checkpoint()】// Frequently used later RDD、 Or very important RDD
Case study
Directly on the basis of the previous case , With a little modification , First specify HDFS The catalog of , then persist or cache Replace with checkpoint.
# -*- coding:utf-8 -*-
# Desc:This is Code Desc
import os
import json
import re
import time
from pyspark import SparkConf, SparkContext, StorageLevel
os.environ['SPARK_HOME'] = '/export/server/spark'
PYSPARK_PYTHON = "/root/anaconda3/bin/python3.8"
# When multiple versions exist , Failure to specify is likely to result in an error
os.environ["PYSPARK_PYTHON"] = PYSPARK_PYTHON
os.environ["PYSPARK_DRIVER_PYTHON"] = PYSPARK_PYTHON
if __name__ == '__main__':
#1- establish SparkContext Context object
conf=SparkConf().setAppName("2_rdd_from_external").setMaster("local[*]")
sc=SparkContext(conf=conf)
sc.setCheckpointDir("hdfs://node1:8020/output/ckp/6_checkpoint")
#2 Read the file
rdd=sc.textFile("file:///export/pyworkspace/pyspark_sz26/pyspark-sparkcore-3.1.2/data/apache.log")
#3 Wash and extract ip,\\s+ Represents general white space characters , such as tab key , Space , Line break ,\r\t
rdd2=rdd.map(lambda line : re.split('\\s+',line)[0])
# Yes rdd2 checkpoint Persist to HDFS
rdd2.checkpoint()
pv = rdd2.count()
#4 Calculation pv, Print
pv=rdd2.count()
print('pv=',pv)
#5 Calculation uv, Print
uv=rdd2.distinct().count()
print('uv=',uv)
time.sleep(600)
sc.stop()
result :
Persistence and Checkpoint The difference between
Location difference :persist or cache Will RDD The data is stored in 【 Memory 】、【 disk 】、【 Out of heap memory 】 in , however checkpoint Mechanism will RDD Data saved in 【HDFS】 On .
Life cycle : When Application completion of enforcement , Or call 【unpersist】, that persist or cache The data will be automatically cleared . however checkpoint Contents of the catalog 【 Can't 】 Automatic removal of , It needs to be cleared manually .
Consanguinity :persist or cache【 Meeting 】 Retain RDD By blood , If the data of a partition is lost , Then we can use 【 Dependent on kinship 】 Recalculate . however HDFS【 no need 】 Retain dependencies , Because even if the data of a partition is lost or damaged , Then it can also be used directly and conveniently HDFS In addition to 【2】 Copies .
边栏推荐
- [interpretation of the paper] machine learning technology for Cataract Classification / classification
- 50 commonly used numpy function explanations, parameters and usage examples
- Hill | insert sort
- Nodejs tutorial let's create your first expressjs application with typescript
- 中国白酒的5场大战
- 1292_FreeROS中vTaskResume()以及xTaskResumeFromISR()的实现分析
- Redistemplate common collection instructions opsforzset (VI)
- VIM basic configuration and frequently used commands
- Yyds dry inventory run kubeedge official example_ Counter demo counter
- 语谱图怎么看
猜你喜欢
Why does MySQL index fail? When do I use indexes?
[sliding window] group B of the 9th Landbridge cup provincial tournament: log statistics
对话阿里巴巴副总裁贾扬清:追求大模型,并不是一件坏事
Uni app app half screen continuous code scanning
What can one line of code do?
互联网快讯:吉利正式收购魅族;胰岛素集采在31省全面落地
PostgreSQL 安装gis插件 CREATE EXTENSION postgis_topology
039. (2.8) thoughts in the ward
Numpy download and installation
Quick news: the flybook players' conference is held online; Wechat payment launched "education and training service toolbox"
随机推荐
Nodejs tutorial let's create your first expressjs application with typescript
document. Usage of write () - write text - modify style and position control
3D face reconstruction: from basic knowledge to recognition / reconstruction methods!
JS traversal array and string
High precision face recognition based on insightface, which can directly benchmark hongruan
Internet News: Geely officially acquired Meizu; Intensive insulin purchase was fully implemented in 31 provinces
Hill | insert sort
[interpretation of the paper] machine learning technology for Cataract Classification / classification
快讯:飞书玩家大会线上举行;微信支付推出“教培服务工具箱”
Is this the feeling of being spoiled by bytes?
jvm:大对象在老年代的分配
The relationship between root and coefficient of quadratic equation with one variable
Fastjson parses JSON strings (deserialized to list, map)
Set up a time server
C how to set two columns comboboxcolumn in DataGridView to bind a secondary linkage effect of cascading events
HMS core machine learning service creates a new "sound" state of simultaneous interpreting translation, and AI makes international exchanges smoother
This year, Jianzhi Tencent
C语言:#if、#def和#ifndef综合应用
Is it profitable to host an Olympic Games?
数字化转型挂帅复产复工,线上线下全融合重建商业逻辑