当前位置:网站首页>Hudi of data Lake (14): basic concepts of Apache Hudi
Hudi of data Lake (14): basic concepts of Apache Hudi
2022-06-12 03:15:00 【Electro optic scintillation】
Catalog
5. Hudi Data storage management
0. Links to related articles
Basic knowledge points of big data A summary of the article
1. executive summary
Hudi Provides Hudi The concept of table , These tables support CRUD operation , You can use existing big data clusters, such as HDFS Do data file storage , And then use SparkSQL or Hive Wait for the analysis engine to analyze and query the data .Hudi The three main components of the table :1)、 Ordered timeline metadata , Similar to database transaction log .2)、 Hierarchical data files : The data actually written into the table ;3) Indexes ( Multiple ways of implementation ): Map the dataset containing the specified record .

2. time axis Timeline
- Hudi The core : Maintained in all tables One contain In different instant (Instant) Time to data set operation ( For example, adding 、 Modify or delete ) Of time axis (Timeline).

- stay Every time Hudi Data set operation of table All the time Of this table Timeline Generate a Instant, Thus, the data submitted successfully after only querying a certain point in time can be realized , Or just query the data before a certain time point , Effectively avoid scanning data in a larger time range .

- meanwhile , It can efficiently query only the files before change ( As in a Instant After submitting the change , only query Data before a certain point in time , You can still query Data before modification ).
- Timeline yes Hudi Used to manage submission (commit) The abstraction of , Every commit All bound to a fixed timestamp , Spread out on the timeline .
- stay Timeline On , Every commit Abstracted as a HoodieInstant, One instant Recorded a submission (commit) act 、 Time stamp 、 And status .

- Time is used in the picture above ( Hours ) As a partition field , from 10:00 It began to produce all kinds of commits,10:20 Here comes a 9:00 The data of , The data can still fall into 9:00 Corresponding partition , adopt timeline Direct consumption 10:00 Incremental updates after ( Only consumption has new commits Of group), So this delayed data can still be consumed to .
- time axis (Timeline) Implementation class of ( be located hudi-common-xx.jar in ), The implementation classes related to the timeline are located in org.apache.hudi.common.table.timeline It's a bag .

3. file management
- Hudi take DFS Data sets on are organized into basic paths (HoodieWriteConfig.BASEPATHPROP) Under the directory structure .
- The dataset is divided into multiple partitions (DataSourceOptions.PARTITIONPATHFIELDOPT_KEY), These partitions are related to Hive The watch is very similar , Is the folder containing the data files of the partition .

- In each zone , Files are organized into file groups , By document id Act as a unique identifier . Each filegroup contains multiple file slices , Each slice contains a submission at an instant / Compress the generated basic column file (.parquet) And a set of log files (.log), This file contains inserts to the base file since it was generated / to update .

- Hudi Of base file (parquet file ) stay footer Of meta To record record key Composed of BloomFilter, Used in file based index To achieve high efficiency in the realization of key contains testing .
- Hudi Of log (avro file ) It's self coded , By accumulating data buffer With LogBlock Write... For the unit , Every LogBlock contain magic number、size、content、footer Etc , For data reading 、 Check and filter .

4. Indexes Index
- Hudi Provide efficient services through indexing mechanism Upsert operation , The mechanism will put a RecordKey+PartitionPath The combination method is mapped to a file as a unique identity ID, And this unique ID and filegroup / file ID The mapping between records will not change since the records are written to the filegroup .
- Global index : The key is required to be unique under all partition ranges of the whole table , That is, ensure that there is and only one corresponding record for a given key .
- Non global index : Only within one partition of the table, the key is required to be unique , It relies on the writer to provide a consistent partition path for the deletion of the same record .

5. Hudi Data storage management

notes :Hudi The series of blog posts are through Hudi Written in the official website learning records , One of them is to add personal understanding , If there is any deficiency , Please understand
notes : Links to other related articles go here ( Include Hudi Blog posts related to big data, including ) -> Basic knowledge points of big data A summary of the article
边栏推荐
猜你喜欢

Machine learning - dimensionality reduction (data compression, data visualization)

推荐6款办公软件,好用还免费,效率翻倍

I2C协议概述

【高代码文件格式API】道宁为您提供文件格式API集——Aspose,只需几行代码即可创建转换和操作100多种文件格式

Calculus review 2
![[point cloud compression] variable image compression with a scale hyperprior](/img/d4/4084f64d20c8e622cddef2310d3b6c.png)
[point cloud compression] variable image compression with a scale hyperprior

RPC 入门

Solutions to errors in ROM opening by MAME

如何防止商場電氣火灾的發生?

Comparaison de la taille des fractions
随机推荐
[string] judge whether S2 is the rotation string of S1
2020-12-06
Computer common sense
Intel Galileo Gen2 development
[digital signal processing] correlation function (finite signal | autocorrelation function of finite signal)
How to reduce the complexity of cloud computing infrastructure?
Go 语法 变量
string manipulation:
$LastExitCode=0, but $?= False in PowerShell. Redirecting stderr to stdout gives NativeCommandError
Infinite loop judgment method;
推荐6款办公软件,好用还免费,效率翻倍
I2C protocol overview
2020-12-06
1 minute to understand the essential difference between low code and zero code
Addition and multiplication of large integers;
Interpreting 2021 of middleware: after being reshaped by cloud nativity, it is more difficult to select models
The four pain points of enterprise digitalization are solved by low code platform
微信小程序項目實例——體質計算器
Wechat applet project example - I have a paintbrush (painting)
[Bank Research Report] technology enabled retail finance carbon neutral development report (2022) - download link attached