当前位置:网站首页>Hudi quick experience (including detailed operation steps and screenshots)
Hudi quick experience (including detailed operation steps and screenshots)
2022-07-03 09:25:00 【Did Xiao Hu get stronger today】
List of articles
Hudi Quick experience
This example is to complete the following process :
It needs to be installed in advance hadoop、spark as well as hudi And components .
spark Installation tutorial :
https://blog.csdn.net/hshudoudou/article/details/125204028?spm=1001.2014.3001.5501
hudi Compilation and installation tutorial :
https://blog.csdn.net/hshudoudou/article/details/123881739?spm=1001.2014.3001.5501
Attention only Hudi Management data , Don't store data , Do not analyze data .
start-up spark-shel l add to jar package
./spark-shell \
--master local[2] \
--jars /home/hty/hudi-jars/hudi-spark3-bundle_2.12-0.9.0.jar,\
/home/hty/hudi-jars/spark-avro_2.12-3.0.1.jar,/home/hty/hudi-jars/spark_unused-1.0.0.jar.jar \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"
You can see three jar Packages are uploaded successfully :
Import the package and set the storage directory :
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
val tableName = "hudi_trips_cow"
val basePath = "hdfs://hadoop102:8020/datas/hudi-warehouse/hudi_trips_cow"
val dataGen = new DataGenerator
Simulation produces Trip Ride data
val inserts = convertToStringList(dataGen.generateInserts(10))
3. The simulated data List Convert to DataFrame Data sets
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
4. View post conversion DataFrame Data sets Schema Information
5. Select the relevant field , View simulated sample data
df.select("rider", "begin_lat", "begin_lon", "driver", "fare", "uuid", "ts").show(10, truncate=false)
insert data
The simulation will produce Trip data , Save to Hudi In the table , because Hudi Born based on Spark frame , therefore SparkSQL Support Hudi data source , Direct communication too format specify data source Source, Set relevant properties and save data .
df.write
.mode(Overwrite)
.format("hudi")
.options (getQuickstartWriteConfigs)
.option(PRECOMBINE_FIELD_OPT_KEY, "ts")
.option(RECORDKEY_FIELD_OPT_KEY, "uuid")
.option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath")
.option(TABLE_NAME, tableName)
.save(basePath)
getQuickstartWriteConfigs, Set write / Update data to Hudi when ,Shuffle Number of time partitions
PRECOMBINE_FIELD_OPT_KEY, When data is merged , According to the primary key field
RECORDKEY_FIELD_OPT_KEY, Unique for each record id, Support multiple fields
PARTITIONPATH_FIELD_OPT_KEY, Partition field used to store data
paste Pattern , Press after pasting ctrl + d perform .
Hudi Table data is stored in HDFS On , With PARQUET Stored in columns
from Hudi Read data in table , Same use SparkSQL How to load data from external data sources , Appoint format Data source and related parameters options:
val tripSnapshotDF = spark.read.format("hudi").load(basePath + "/*/*/*/*")
It specifies Hudi Table data storage path , Using regularization Regex How to match , Because of saving Hudi Table belongs to partition table , And it is a three-level partition ( phase When Hive Specify three partition fields in the table ), Use expressions ://// Load all the data .
View table structure :
tripSnapshotDF.printSchema()
Save to Hudi There are many data in the table 5 A field , These fields belong to Hudi Related fields used when managing data .
Will get Hudi Table data DataFrame Register as a temporary view , use SQL Method: query and analyze data based on business :
tripSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
utilize sqark SQL Inquire about
spark.sql("select fare, begin_lat, begin_lon, ts from hudi_trips_snapshot where fare > 20.0").show()
Check the newly added fields :
spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, _hoodie_file_name from hudi_trips_snapshot").show()
These newly added fields are hudi Fields added for table management .
Reference material :
边栏推荐
- Instant messaging IM is the countercurrent of the progress of the times? See what jnpf says
- 2022-1-6 Niuke net brush sword finger offer
- Integrated use of interlij idea and sonarqube
- On February 14, 2022, learn the imitation Niuke project - develop the registration function
- 【点云处理之论文狂读经典版12】—— FoldingNet: Point Cloud Auto-encoder via Deep Grid Deformation
- Temper cattle ranking problem
- The "booster" of traditional office mode, Building OA office system, was so simple!
- Spark 结构化流写入Hudi 实践
- Crawler career from scratch (I): crawl the photos of my little sister ① (the website has been disabled)
- Flink学习笔记(十一)Table API 和 SQL
猜你喜欢
【点云处理之论文狂读前沿版10】—— MVTN: Multi-View Transformation Network for 3D Shape Recognition
Crawler career from scratch (IV): climb the bullet curtain of station B through API
【点云处理之论文狂读前沿版11】—— Unsupervised Point Cloud Pre-training via Occlusion Completion
【点云处理之论文狂读经典版11】—— Mining Point Cloud Local Structures by Kernel Correlation and Graph Pooling
Pic16f648a-e/ss PIC16 8-bit microcontroller, 7KB (4kx14)
NPM install installation dependency package error reporting solution
Jenkins learning (III) -- setting scheduled tasks
What are the stages of traditional enterprise digital transformation?
Win10 quick screenshot
【点云处理之论文狂读经典版14】—— Dynamic Graph CNN for Learning on Point Clouds
随机推荐
C language programming specification
Move anaconda, pycharm and jupyter notebook to mobile hard disk
[point cloud processing paper crazy reading frontier version 11] - unsupervised point cloud pre training via occlusion completion
Tag paste operator (#)
307. Range Sum Query - Mutable
Trial of the combination of RDS and crawler
Recommend a low code open source project of yyds
The less successful implementation and lessons of RESNET
[graduation season | advanced technology Er] another graduation season, I change my career as soon as I graduate, from animal science to programmer. Programmers have something to say in 10 years
Spark 集群安装与部署
Banner - Summary of closed group meeting
LeetCode每日一题(1300. Sum of Mutated Array Closest to Target)
2022-2-14 learning xiangniuke project - generate verification code
[kotlin puzzle] what happens if you overload an arithmetic operator in the kotlin class and declare the operator as an extension function?
Linxu learning (4) -- Yum and apt commands
LeetCode每日一题(2305. Fair Distribution of Cookies)
【点云处理之论文狂读经典版11】—— Mining Point Cloud Local Structures by Kernel Correlation and Graph Pooling
The server denied password root remote connection access
Digital statistics DP acwing 338 Counting problem
Utilisation de hudi dans idea