当前位置:网站首页>Hudi of data Lake (14): basic concepts of Apache Hudi

Hudi of data Lake (14): basic concepts of Apache Hudi

2022-06-12 03:15:00 Electro optic scintillation

Catalog

0. Links to related articles

1. executive summary

2. time axis Timeline

3. file management

4. Indexes Index

5. Hudi Data storage management


0. Links to related articles

Basic knowledge points of big data A summary of the article

1. executive summary

        Hudi Provides Hudi The concept of table , These tables support CRUD operation , You can use existing big data clusters, such as HDFS Do data file storage , And then use SparkSQL or Hive Wait for the analysis engine to analyze and query the data .Hudi The three main components of the table :1)、 Ordered timeline metadata , Similar to database transaction log .2)、 Hierarchical data files : The data actually written into the table ;3) Indexes ( Multiple ways of implementation ): Map the dataset containing the specified record .

2. time axis Timeline

  • Hudi The core : Maintained in all tables One contain In different instant (Instant) Time to data set operation ( For example, adding 、 Modify or delete ) Of time axis (Timeline).

  • stay Every time Hudi Data set operation of table All the time Of this table Timeline Generate a Instant, Thus, the data submitted successfully after only querying a certain point in time can be realized , Or just query the data before a certain time point , Effectively avoid scanning data in a larger time range .

  • meanwhile , It can efficiently query only the files before change ( As in a Instant After submitting the change , only query Data before a certain point in time , You can still query Data before modification ).
  • Timeline yes Hudi Used to manage submission (commit) The abstraction of , Every commit All bound to a fixed timestamp , Spread out on the timeline .
  • stay Timeline On , Every commit Abstracted as a HoodieInstant, One instant Recorded a submission (commit) act 、 Time stamp 、 And status .

  • Time is used in the picture above ( Hours ) As a partition field , from 10:00 It began to produce all kinds of commits,10:20 Here comes a 9:00 The data of , The data can still fall into 9:00 Corresponding partition , adopt timeline Direct consumption 10:00 Incremental updates after ( Only consumption has new commits Of group), So this delayed data can still be consumed to .
  • time axis (Timeline) Implementation class of ( be located hudi-common-xx.jar in ), The implementation classes related to the timeline are located in org.apache.hudi.common.table.timeline It's a bag .

3. file management

  • Hudi take DFS Data sets on are organized into basic paths (HoodieWriteConfig.BASEPATHPROP) Under the directory structure .
  • The dataset is divided into multiple partitions (DataSourceOptions.PARTITIONPATHFIELDOPT_KEY), These partitions are related to Hive The watch is very similar , Is the folder containing the data files of the partition .

  • In each zone , Files are organized into file groups , By document id Act as a unique identifier . Each filegroup contains multiple file slices , Each slice contains a submission at an instant / Compress the generated basic column file (.parquet) And a set of log files (.log), This file contains inserts to the base file since it was generated / to update .

  • Hudi Of base file (parquet file ) stay footer Of meta To record record key Composed of BloomFilter, Used in file based index To achieve high efficiency in the realization of key contains testing .
  • Hudi Of log (avro file ) It's self coded , By accumulating data buffer With LogBlock Write... For the unit , Every LogBlock contain magic number、size、content、footer Etc , For data reading 、 Check and filter .

4. Indexes Index

  • Hudi Provide efficient services through indexing mechanism Upsert operation , The mechanism will put a RecordKey+PartitionPath The combination method is mapped to a file as a unique identity ID, And this unique ID and filegroup / file ID The mapping between records will not change since the records are written to the filegroup .
    • Global index : The key is required to be unique under all partition ranges of the whole table , That is, ensure that there is and only one corresponding record for a given key .
    • Non global index : Only within one partition of the table, the key is required to be unique , It relies on the writer to provide a consistent partition path for the deletion of the same record .

5. Hudi Data storage management


notes :Hudi The series of blog posts are through Hudi Written in the official website learning records , One of them is to add personal understanding , If there is any deficiency , Please understand

notes : Links to other related articles go here ( Include Hudi Blog posts related to big data, including ) ->  Basic knowledge points of big data A summary of the article


原网站

版权声明
本文为[Electro optic scintillation]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/03/202203011100412307.html