当前位置:网站首页>Hudi quick experience (including detailed operation steps and screenshots)
Hudi quick experience (including detailed operation steps and screenshots)
2022-07-03 09:25:00 【Did Xiao Hu get stronger today】
List of articles
Hudi Quick experience
This example is to complete the following process :
It needs to be installed in advance hadoop、spark as well as hudi And components .
spark Installation tutorial :
https://blog.csdn.net/hshudoudou/article/details/125204028?spm=1001.2014.3001.5501
hudi Compilation and installation tutorial :
https://blog.csdn.net/hshudoudou/article/details/123881739?spm=1001.2014.3001.5501
Attention only Hudi Management data , Don't store data , Do not analyze data .
start-up spark-shel l add to jar package
./spark-shell \
--master local[2] \
--jars /home/hty/hudi-jars/hudi-spark3-bundle_2.12-0.9.0.jar,\
/home/hty/hudi-jars/spark-avro_2.12-3.0.1.jar,/home/hty/hudi-jars/spark_unused-1.0.0.jar.jar \
--conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"
You can see three jar Packages are uploaded successfully :
Import the package and set the storage directory :
import org.apache.hudi.QuickstartUtils._
import scala.collection.JavaConversions._
import org.apache.spark.sql.SaveMode._
import org.apache.hudi.DataSourceReadOptions._
import org.apache.hudi.DataSourceWriteOptions._
import org.apache.hudi.config.HoodieWriteConfig._
val tableName = "hudi_trips_cow"
val basePath = "hdfs://hadoop102:8020/datas/hudi-warehouse/hudi_trips_cow"
val dataGen = new DataGenerator
Simulation produces Trip Ride data
val inserts = convertToStringList(dataGen.generateInserts(10))
3. The simulated data List Convert to DataFrame Data sets
val df = spark.read.json(spark.sparkContext.parallelize(inserts, 2))
4. View post conversion DataFrame Data sets Schema Information
5. Select the relevant field , View simulated sample data
df.select("rider", "begin_lat", "begin_lon", "driver", "fare", "uuid", "ts").show(10, truncate=false)
insert data
The simulation will produce Trip data , Save to Hudi In the table , because Hudi Born based on Spark frame , therefore SparkSQL Support Hudi data source , Direct communication too format specify data source Source, Set relevant properties and save data .
df.write
.mode(Overwrite)
.format("hudi")
.options (getQuickstartWriteConfigs)
.option(PRECOMBINE_FIELD_OPT_KEY, "ts")
.option(RECORDKEY_FIELD_OPT_KEY, "uuid")
.option(PARTITIONPATH_FIELD_OPT_KEY, "partitionpath")
.option(TABLE_NAME, tableName)
.save(basePath)
getQuickstartWriteConfigs, Set write / Update data to Hudi when ,Shuffle Number of time partitions
PRECOMBINE_FIELD_OPT_KEY, When data is merged , According to the primary key field
RECORDKEY_FIELD_OPT_KEY, Unique for each record id, Support multiple fields
PARTITIONPATH_FIELD_OPT_KEY, Partition field used to store data
paste Pattern , Press after pasting ctrl + d perform .
Hudi Table data is stored in HDFS On , With PARQUET Stored in columns
from Hudi Read data in table , Same use SparkSQL How to load data from external data sources , Appoint format Data source and related parameters options:
val tripSnapshotDF = spark.read.format("hudi").load(basePath + "/*/*/*/*")
It specifies Hudi Table data storage path , Using regularization Regex How to match , Because of saving Hudi Table belongs to partition table , And it is a three-level partition ( phase When Hive Specify three partition fields in the table ), Use expressions ://// Load all the data .
View table structure :
tripSnapshotDF.printSchema()
Save to Hudi There are many data in the table 5 A field , These fields belong to Hudi Related fields used when managing data .
Will get Hudi Table data DataFrame Register as a temporary view , use SQL Method: query and analyze data based on business :
tripSnapshotDF.createOrReplaceTempView("hudi_trips_snapshot")
utilize sqark SQL Inquire about
spark.sql("select fare, begin_lat, begin_lon, ts from hudi_trips_snapshot where fare > 20.0").show()
Check the newly added fields :
spark.sql("select _hoodie_commit_time, _hoodie_record_key, _hoodie_partition_path, _hoodie_file_name from hudi_trips_snapshot").show()
These newly added fields are hudi Fields added for table management .
Reference material :
边栏推荐
- Solve POM in idea Comment top line problem in XML file
- Integrated use of interlij idea and sonarqube
- Go language - IO project
- Logstash+jdbc data synchronization +head display problems
- 2022-2-14 learning xiangniuke project - Session Management
- Utilisation de hudi dans idea
- LeetCode每日一题(1162. As Far from Land as Possible)
- [graduation season | advanced technology Er] another graduation season, I change my career as soon as I graduate, from animal science to programmer. Programmers have something to say in 10 years
- Flink学习笔记(八)多流转换
- Jenkins learning (III) -- setting scheduled tasks
猜你喜欢
Excel is not as good as jnpf form for 3 minutes in an hour. Leaders must praise it when making reports like this!
Trial of the combination of RDS and crawler
CATIA automation object architecture - detailed explanation of application objects (I) document/settingcontrollers
【点云处理之论文狂读经典版7】—— Dynamic Edge-Conditioned Filters in Convolutional Neural Networks on Graphs
[point cloud processing paper crazy reading frontier edition 13] - gapnet: graph attention based point neural network for exploring local feature
Spark overview
Principles of computer composition - cache, connection mapping, learning experience
[kotlin learning] classes, objects and interfaces - classes with non default construction methods or attributes, data classes and class delegates, object keywords
Flink-CDC实践(含实操步骤与截图)
Idea uses the MVN command to package and report an error, which is not available
随机推荐
Powerdesign reverse wizard such as SQL and generates name and comment
PowerDesigner does not display table fields, only displays table names and references, which can be modified synchronously
Common formulas of probability theory
【点云处理之论文狂读经典版11】—— Mining Point Cloud Local Structures by Kernel Correlation and Graph Pooling
Hudi learning notes (III) analysis of core concepts
2022-1-6 Niuke net brush sword finger offer
Filter comments to filter out uncommented and default values
【点云处理之论文狂读前沿版13】—— GAPNet: Graph Attention based Point Neural Network for Exploiting Local Feature
Navicat, MySQL export Er graph, er graph
On February 14, 2022, learn the imitation Niuke project - develop the registration function
Jenkins learning (II) -- setting up Chinese
How to check whether the disk is in guid format (GPT) or MBR format? Judge whether UEFI mode starts or legacy mode starts?
307. Range Sum Query - Mutable
[point cloud processing paper crazy reading cutting-edge version 12] - adaptive graph revolution for point cloud analysis
Computing level network notes
Trial of the combination of RDS and crawler
Go language - JSON processing
2022-2-13 learning the imitation Niuke project - home page of the development community
【点云处理之论文狂读经典版13】—— Adaptive Graph Convolutional Neural Networks
Explanation of the answers to the three questions