当前位置:网站首页>Hudi of data Lake (14): basic concepts of Apache Hudi
Hudi of data Lake (14): basic concepts of Apache Hudi
2022-06-12 03:15:00 【Electro optic scintillation】
Catalog
5. Hudi Data storage management
0. Links to related articles
Basic knowledge points of big data A summary of the article
1. executive summary
Hudi Provides Hudi The concept of table , These tables support CRUD operation , You can use existing big data clusters, such as HDFS Do data file storage , And then use SparkSQL or Hive Wait for the analysis engine to analyze and query the data .Hudi The three main components of the table :1)、 Ordered timeline metadata , Similar to database transaction log .2)、 Hierarchical data files : The data actually written into the table ;3) Indexes ( Multiple ways of implementation ): Map the dataset containing the specified record .

2. time axis Timeline
- Hudi The core : Maintained in all tables One contain In different instant (Instant) Time to data set operation ( For example, adding 、 Modify or delete ) Of time axis (Timeline).

- stay Every time Hudi Data set operation of table All the time Of this table Timeline Generate a Instant, Thus, the data submitted successfully after only querying a certain point in time can be realized , Or just query the data before a certain time point , Effectively avoid scanning data in a larger time range .

- meanwhile , It can efficiently query only the files before change ( As in a Instant After submitting the change , only query Data before a certain point in time , You can still query Data before modification ).
- Timeline yes Hudi Used to manage submission (commit) The abstraction of , Every commit All bound to a fixed timestamp , Spread out on the timeline .
- stay Timeline On , Every commit Abstracted as a HoodieInstant, One instant Recorded a submission (commit) act 、 Time stamp 、 And status .

- Time is used in the picture above ( Hours ) As a partition field , from 10:00 It began to produce all kinds of commits,10:20 Here comes a 9:00 The data of , The data can still fall into 9:00 Corresponding partition , adopt timeline Direct consumption 10:00 Incremental updates after ( Only consumption has new commits Of group), So this delayed data can still be consumed to .
- time axis (Timeline) Implementation class of ( be located hudi-common-xx.jar in ), The implementation classes related to the timeline are located in org.apache.hudi.common.table.timeline It's a bag .

3. file management
- Hudi take DFS Data sets on are organized into basic paths (HoodieWriteConfig.BASEPATHPROP) Under the directory structure .
- The dataset is divided into multiple partitions (DataSourceOptions.PARTITIONPATHFIELDOPT_KEY), These partitions are related to Hive The watch is very similar , Is the folder containing the data files of the partition .

- In each zone , Files are organized into file groups , By document id Act as a unique identifier . Each filegroup contains multiple file slices , Each slice contains a submission at an instant / Compress the generated basic column file (.parquet) And a set of log files (.log), This file contains inserts to the base file since it was generated / to update .

- Hudi Of base file (parquet file ) stay footer Of meta To record record key Composed of BloomFilter, Used in file based index To achieve high efficiency in the realization of key contains testing .
- Hudi Of log (avro file ) It's self coded , By accumulating data buffer With LogBlock Write... For the unit , Every LogBlock contain magic number、size、content、footer Etc , For data reading 、 Check and filter .

4. Indexes Index
- Hudi Provide efficient services through indexing mechanism Upsert operation , The mechanism will put a RecordKey+PartitionPath The combination method is mapped to a file as a unique identity ID, And this unique ID and filegroup / file ID The mapping between records will not change since the records are written to the filegroup .
- Global index : The key is required to be unique under all partition ranges of the whole table , That is, ensure that there is and only one corresponding record for a given key .
- Non global index : Only within one partition of the table, the key is required to be unique , It relies on the writer to provide a consistent partition path for the deletion of the same record .

5. Hudi Data storage management

notes :Hudi The series of blog posts are through Hudi Written in the official website learning records , One of them is to add personal understanding , If there is any deficiency , Please understand
notes : Links to other related articles go here ( Include Hudi Blog posts related to big data, including ) -> Basic knowledge points of big data A summary of the article
边栏推荐
- 如何防止商場電氣火灾的發生?
- errno: -4091, syscall: ‘listen‘, address: ‘::‘, port: 8000
- Comparison of scores
- 2020-12-17
- 1 minute to understand the essential difference between low code and zero code
- Demand and business model innovation - demand 6- stakeholder analysis and hard sampling
- 1187_ C language implementation of hysteresis processing
- 1186_ Accumulation of embedded hardware knowledge_ Triode and three electrodes
- [point cloud compression] variable image compression with a scale hyperprior
- C language array
猜你喜欢

Introduction to architecture - who moved my cake

AI interview bag | Netease mutual entertainment AI Lab artificial intelligence research engineers share on both sides

errno: -4078, code: ‘ECONNREFUSED‘, syscall: ‘connect‘, address: ‘127.0.0.1‘, port: 3306; Postman error

Special information | liquor (Baijiu, beer, wine)

RPC 入门

errno: -4091, syscall: ‘listen‘, address: ‘::‘, port: 8000

What is the core of Web3?

[DFS "want" or "don't"] seek subsets; Seeking combination

微积分复习2

2020-12-06
随机推荐
Penetration test - file upload
1187_ C language implementation of hysteresis processing
Wechat applet project example - I have a paintbrush (painting)
laravel 8 选用 jwt 进行接口验证
The road of global evolution of vivo global mall -- multilingual solution
I2C protocol overview
DbNull if statement - DbNull if statement
Infinite loop judgment method;
Sparse tensor based point cloud attribute compression
2020-12-06
2020-12-07
RPC 入门
Kubernetes' learning path. Is there any "easy mode" Q recommendation for container hybrid cloud
微信小程序项目实例——我有一支画笔(画画)
安科瑞抗晃电产品在河北某化工项目的应用
What is the commonly heard sub table of MySQL? Why does MySQL need tables?
[digital signal processing] correlation function (finite signal | autocorrelation function of finite signal)
[Business Research Report] 2021 global mobile game player white paper - download link attached
Application of ankery anti shake electric products in a chemical project in Hebei
Unscrambling 2021 of service grid: bid farewell to the "great leap forward" of architecture, and a hundred schools of thought contend for the technological ecology