当前位置:网站首页>Hudi key generation
Hudi key generation
2022-06-12 12:55:00 【From big data to artificial intelligence】
Hudi Each record in is uniquely identified by a primary key , A primary key is a parameter used to record the record key and partition path to which the record belongs . Use primary key ,Hudi You can force a) Partition level uniqueness integrity constraints b) Allows you to quickly update and delete records . You should choose partition mode wisely , Because it may be a determinant of intake and query delay .
Usually ,Hudi Support partition index and global index . For with partitioned indexes ( This is the most commonly used ) Data set of , Each record is uniquely identified by a pair of record keys and partition paths . But for datasets with global indexes , Each record is uniquely identified only by the record key . There will be no duplicate record keys between partitions .
Key generator
Hudi Several key generators are provided out of the box , Users can use them according to their own needs , It also provides a pluggable implementation , For users to implement and use their own KeyGenerator. This page will introduce all the different types of key generators that can be used at any time .
Here is KeyGenerator stay Hudi The interface of , For your reference .
Before delving into the different types of key generators , Let's first review some common configurations that key generators need to set up .
Config | Meaning/purpose |
|---|---|
hoodie.datasource.write.recordkey.field | Refers to record key field. This is a mandatory field. |
hoodie.datasource.write.partitionpath.field | Refers to partition path field. This is a mandatory field. |
hoodie.datasource.write.keygenerator.class | Refers to Key generator class(including full path). Could refer to any of the available ones or user defined one. This is a mandatory field. |
hoodie.datasource.write.partitionpath.urlencode | When set to true, partition path will be url encoded. Default value is false. |
hoodie.datasource.write.hive_style_partitioning | When set to true, uses hive style partitioning. Partition field name will be prefixed to the value. Format: “\<partition_path_field_name>=\<partition_path_value>”. Default value is false. |
If you are looking for TimestampBasedKeyGenerator, More configuration is required . We will introduce... In our respective chapters .
Let's see what we can do with Hudi Different key generators for .
SimpleKeyGenerator
The record key represents a field by name (dataframe Column in ), The partition path represents a field by name (dataframe Single column in ). This is the most commonly used one . The value is interpreted as coming from the data frame and converted to a string .
ComplexKeyGenerator
Both the record key and the partition path are defined by the name ( A combination of multiple fields ) Make up one or more fields . Fields are separated by commas in the configuration values . for example “Hoodie.datasource.write.recordkey. Field ”:“col1 col4”
GlobalDeleteKeyGenerator
Global index deletion does not require partition values . So this key generator avoids using partition values to generate HoodieKey.
NoPartitionedKeyGenerator
If your hudi The dataset has no partitions , You can use this “NonPartitionedKeyGenerator”, It will return an empty partition for all records . let me put it another way , All records go to the same partition ( It's empty ” “)
CustomKeyGenerator
This is a KeyGenerator A general implementation of , Users can take advantage of SimpleKeyGenerator、ComplexKeyGenerator and TimestampBasedKeyGenerator The advantages of . The record key and partition path can be configured as a single field or a combination of multiple fields .
hoodie.datasource.write.recordkey.field
hoodie.datasource.write.partitionpath.field
hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.CustomKeyGeneratorIf you want to define complex partition paths that contain regular fields and timestamp based fields , This keyGenerator Particularly useful . It expects to configure ”hoodie.datasource.write.partitionpath.field” The format of the field . The format should be “field1:PartitionKeyType1,field2:PartitionKeyType2…”
The full partition path is created as \<value for field1 basis PartitionKeyType1>/\<value for field2 basis PartitionKeyType2> wait . Each partition key type can be SIMPLE or TIMESTAMP.
Configure sample values :” field_3:simple,field_5:timestamp “
RecordKey Configuration value in SimpleKeyGenerator Is a single field , In reference to ComplexKeyGenerator Comma separated field names when . Example :
hoodie.datasource.write.recordkey.field=field1,field2This will field1:value1、field2:value2 And other formats to create record keys , Otherwise, only one field can be specified in the case of a simple record key .CustomKeyGenerator Class defines a for configuring partition paths enum PartitionKeyType. It can take two possible values —SIMPLE and TIMESTAMP. For partitioned tables , Need to be field1:PartitionKeyType1、field2:PartitionKeyType2 And other formats hoodie.datasource.write.partitionpath.field The value of the property . for example , If you want to use country and date Two fields create partition paths , Where the latter has a timestamp based value , And it needs to be customized in the given format , You can specify the following
hoodie.datasource.write.partitionpath.field=country:SIMPLE,date:TIMESTAMPThis will create a file in the format \<country_name>/\<date> or country=\<country_name>/date=\<date>, It depends on whether you need hive The division of style .
Achieve your own key generator
You can expand the public here API Class to implement your own custom key generator :
TimestampBasedKeyGenerator
The key generator relies on the timestamp of the partition field . When generating partition path values for records , The field value is interpreted as a timestamp , Instead of just converting to strings . The record key is the same as the key previously selected by the field name . Users need to set up more configurations to use this KeyGenerator.
Configuration settings :
Config | Meaning/purpose |
|---|---|
hoodie.deltastreamer.keygen.timebased.timestamp.type | One of the timestamp types supported(UNIX_TIMESTAMP, DATE_STRING, MIXED, EPOCHMILLISECONDS, SCALAR) |
hoodie.deltastreamer.keygen.timebased.output.dateformat | Output date format |
hoodie.deltastreamer.keygen.timebased.timezone | Timezone of the data format |
oodie.deltastreamer.keygen.timebased.input.dateformat | Input date format |
TimestampBasedKeyGenerator Some examples
Timestamp yes GMT
Config field | Value |
|---|---|
hoodie.deltastreamer.keygen.timebased.timestamp.type | “EPOCHMILLISECONDS” |
hoodie.deltastreamer.keygen.timebased.output.dateformat | “yyyy-MM-dd hh” |
hoodie.deltastreamer.keygen.timebased.timezone | “GMT+8:00” |
Enter the field value :“1578283932000L” Partition path generated by key generator :” 2020-01-06 12 “
If the input field value of some rows is empty . Partition path generated from key generator :” 1970-01-01 08 “
Timestamp yes DATE_STRING
Config field | Value |
|---|---|
hoodie.deltastreamer.keygen.timebased.timestamp.type | “DATE_STRING” |
hoodie.deltastreamer.keygen.timebased.output.dateformat | “yyyy-MM-dd hh” |
hoodie.deltastreamer.keygen.timebased.timezone | “GMT+8:00” |
hoodie.deltastreamer.keygen.timebased.input.dateformat | “yyyy-MM-dd hh:mm:ss” |
Enter the field value :” 2020-01-06 12:12:12 “ Partition path generated by key generator :” 2020-01-06 12 “
If the input field value of some rows is empty . Partition path generated by key generator :” 1970-01-01 12:00:00 “
Scalar Example
Config field | Value |
|---|---|
hoodie.deltastreamer.keygen.timebased.timestamp.type | “SCALAR” |
hoodie.deltastreamer.keygen.timebased.output.dateformat | “yyyy-MM-dd hh” |
hoodie.deltastreamer.keygen.timebased.timezone | “GMT” |
hoodie.deltastreamer.keygen.timebased.timestamp.scalar.time.unit | “days” |
Enter the field value :“20000L” Partition path generated by key generator :” 2024-10-04 12 “
If the input field value is empty . Partition path generated by key generator :” 1970-01-02 12 “
ISO8601WithMsZ Single input format
Config field | Value |
|---|---|
hoodie.deltastreamer.keygen.timebased.timestamp.type | “DATE_STRING” |
hoodie.deltastreamer.keygen.timebased.input.dateformat | “yyyy-MM-dd’T’HH:mm:ss.SSSZ” |
hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex | “” |
hoodie.deltastreamer.keygen.timebased.input.timezone | “” |
hoodie.deltastreamer.keygen.timebased.output.dateformat | “yyyyMMddHH” |
hoodie.deltastreamer.keygen.timebased.output.timezone | “GMT” |
Enter the field value :“2020-04-01T13:01:33.428Z” Partition path generated by key generator :”2020040113″
With multiple input formats ISO8601WithMsZ
Config field | Value |
|---|---|
hoodie.deltastreamer.keygen.timebased.timestamp.type | “DATE_STRING” |
hoodie.deltastreamer.keygen.timebased.input.dateformat | “yyyy-MM-dd’T’HH:mm:ssZ,yyyy-MM-dd’T’HH:mm:ss.SSSZ” |
hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex | “” |
hoodie.deltastreamer.keygen.timebased.input.timezone | “” |
hoodie.deltastreamer.keygen.timebased.output.dateformat | “yyyyMMddHH” |
hoodie.deltastreamer.keygen.timebased.output.timezone | “UTC” |
Enter the field value :“2020-04-01T13:01:33.428Z” Partition path generated by key generator :”2020040113″
With offset using multiple input formats iso8601NoMs
Config field | Value |
|---|---|
hoodie.deltastreamer.keygen.timebased.timestamp.type | “DATE_STRING” |
hoodie.deltastreamer.keygen.timebased.input.dateformat | “yyyy-MM-dd’T’HH:mm:ssZ,yyyy-MM-dd’T’HH:mm:ss.SSSZ” |
hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex | “” |
hoodie.deltastreamer.keygen.timebased.input.timezone | “” |
hoodie.deltastreamer.keygen.timebased.output.dateformat | “yyyyMMddHH” |
hoodie.deltastreamer.keygen.timebased.output.timezone | “UTC” |
Enter the field value :“2020-04-01T13:01:33-05:00” Partition path generated by key generator :”2020040118″
Enter as short date string , And expect the date... In date format
Config field | Value |
|---|---|
hoodie.deltastreamer.keygen.timebased.timestamp.type | “DATE_STRING” |
hoodie.deltastreamer.keygen.timebased.input.dateformat | “yyyy-MM-dd’T’HH:mm:ssZ,yyyy-MM-dd’T’HH:mm:ss.SSSZ,yyyyMMdd” |
hoodie.deltastreamer.keygen.timebased.input.dateformat.list.delimiter.regex | “” |
hoodie.deltastreamer.keygen.timebased.input.timezone | “UTC” |
hoodie.deltastreamer.keygen.timebased.output.dateformat | “MM/dd/yyyy” |
hoodie.deltastreamer.keygen.timebased.output.timezone | “UTC” |
Enter the field value :“20200401” Partition path generated by key generator :”04/01/2020″
This article is for bloggers from big data to artificial intelligence 「xiaozhch5」 The original article of , follow CC 4.0 BY-SA Copyright agreement , For reprint, please attach the original source link and this statement .
Link to the original text :https://lrting.top/backend/bigdata/hudi/hudi-basic/5893/
边栏推荐
- How to adapt the page size when iframe is embedded in a web page
- [HXBCTF 2021]easywill
- Microsoft Word 教程,如何在 Word 中插入页眉或页脚?
- Native JS implements the copy text function
- 快速下载谷歌云盘大文件的5种方法
- ITK Examples/RegistrationITKv4/DeformableRegistration
- 安全KNN
- Build an embedded system software development environment - build a cross compilation environment
- ITK multi-stage registration
- 关于派文的问题
猜你喜欢

Chrome debugging tool

构建嵌入式系统软件开发环境-建立交叉编译环境

Uniapp wechat applet long press the identification QR code to jump to applet and personal wechat

机械臂改进的DH参数与标准DH参数理论知识

this.$ How to solve the problem when refs is undefined?

嵌入式系统概述3-嵌入式系统的开发流程和学习基础、方法

About paiwen

Pytorch官方Faster R-CNN源代码解析(一)——特征提取

嵌入式驱动程序设计
![[HXBCTF 2021]easywill](/img/a2/8bf7d78fccf0d365490a84a8a9883d.jpg)
[HXBCTF 2021]easywill
随机推荐
Share PDF HD version, series
Native JS implements the copy text function
【VIM】. Vimrc configuration, vundle and youcompleteme have been installed
牛顿法解多项式的根
Unittest framework
Theoretical knowledge of improved DH parameters and standard DH parameters of manipulator
R language ggplot2 visualization: use the ggrep package to add a number label to the data point at the end of the line plot
Overview of embedded system 3- development process, learning basis and methods of embedded system
Improve pipeline efficiency: you need to know how to identify the main obstacles in ci/cd pipeline
[HXBCTF 2021]easywill
Summary of knowledge points of ES6, ES7, es8, es9, ES10, es11 and ES12 (interview)
Array -- fancy traversal technique of two-dimensional array
嵌入式系統硬件構成-基於ARM的嵌入式開發板介紹
C language [23] classic interview questions [2]
From simple to deep - websocket
The 4th Zhejiang CTF preliminary contest web pppop
442个作者100页论文!谷歌耗时2年发布大模型新基准BIG-Bench | 开源
Newton method for solving roots of polynomials
vtk 三视图
Binary tree (thoughts)