当前位置:网站首页>Hudi data management and storage overview
Hudi data management and storage overview
2022-07-03 09:25:00 【Did Xiao Hu get stronger today】
List of articles
Data management
**Hudi How to manage data ? **
Use table Table Formal organization data , And the data class in each table like Hive Partition table , Divide data into different directories according to partition fields , Each data has a primary key PrimaryKey, Identify data uniqueness .
Hudi Data management
Hudi Data files for tables , You can use the operating system's file system storage , You can also use HDFS This kind of distributed file system storage . In order to share Analyze the reliability of performance and data , In general use HDFS For storage . With HDFS In terms of storage , One Hudi Table storage files are divided into two categories .
.hoodie
(1).hoodie file : because CRUD The fragility of , Each operation generates a file , When these little files get more and more , Can have a serious impact HDFS Of performance ,Hudi A set of file merging mechanism is designed . .hoodie The corresponding log files related to the file merge operation are stored in the folder .
Hudi As time goes by , A series of CRUD The operation is called Timeline,Timeline A certain operation in , be called Instant.
- Instant Action, Record this operation is a data submission (COMMITS), Or file merging (COMPACTION), Or file cleaning (CLEANS);
- Instant Time, The time of this operation ;
- State, State of operation , launch (REQUESTED), Have in hand (INFLIGHT), It's still done (COMPLETED);
amricas and asia
(2)amricas and asia The relevant path is the actual data file , Storage by partition , The path of the partition key It can be specified .
- Hudi Real data files use Parquet File format storage ;
- There is one metadata Metadata files and data files parquet The column type storage .
- Hudi In order to realize the data CRUD, Need to be able to uniquely identify a record ,Hudi The only field in the dataset will be (record key ) + Where the data is Partition (partitionPath) Unite as the only key to the data .
Hudi Storage overview
Hudi The organization directory structure of the data set and Hive Very similar , A data set corresponds to this root directory . The dataset is fragmented into multiple partitions , Partition Fields exist as folders , This folder contains all the files in this partition .
In the root directory , Each partition has a unique partition path , Each partition data is stored in multiple files .
Each file has a unique fileId And generate files commit Marked . If an update operation occurs , Multiple files share the same fileId, But it will Different commit.
Metadata Metadata
- Time axis (timeline) Maintain the metadata of various operations on the dataset in the form of , To support the transient view of the dataset , This part of metadata is stored Metadata directory stored in the root directory . There are three types of metadata :
- Commits: A single commit Contains information about an atomic write operation to a batch of data on the dataset . We use monotonically increasing timestamps to identify commits, Calibration is the beginning of a write operation .
- Cleans: The background activity used to clear the old version files in the dataset that are no longer used by the query .
- Compactions: To coordinate Hudi Internal data structure differences in background activities . for example , Collect the update operation from the log file based on row storage to the column storage data .
Index Indexes
- Hudi Maintain an index , To support in recording key In the presence of , The newly recorded key Quickly map to the corresponding fileId.
- Bloom filter: Stored in the footer of the data file . The default option , Independent of external system implementation . The data and index are always consistent .
- Apache HBase : It can efficiently find a small batch of key. During index marking , This option may be a few seconds faster .
Data data
Hudi Store all ingested data in two different storage formats , The user can choose any data format that meets the following conditions :
Read optimized column format (ROFormat): The default value is Apache Parquet;
Write optimized line storage format (WOFormat): The default value is Apache Avro
Reference material :
边栏推荐
- Uc/os self-study from 0
- The "booster" of traditional office mode, Building OA office system, was so simple!
- LeetCode每日一题(2212. Maximum Points in an Archery Competition)
- Matlab dichotomy to find the optimal solution
- [point cloud processing paper crazy reading classic version 9] - pointwise revolutionary neural networks
- Temper cattle ranking problem
- [point cloud processing paper crazy reading classic version 12] - foldingnet: point cloud auto encoder via deep grid deformation
- Common formulas of probability theory
- Introduction to the basic application and skills of QT
- Hudi 快速体验使用(含操作详细步骤及截图)
猜你喜欢
【点云处理之论文狂读前沿版8】—— Pointview-GCN: 3D Shape Classification With Multi-View Point Clouds
[advanced feature learning on point clouds using multi resolution features and learning]
Flink学习笔记(九)状态编程
Flink-CDC实践(含实操步骤与截图)
[point cloud processing paper crazy reading cutting-edge version 12] - adaptive graph revolution for point cloud analysis
2022-2-13 learning xiangniuke project - version control
Navicat, MySQL export Er graph, er graph
【点云处理之论文狂读前沿版10】—— MVTN: Multi-View Transformation Network for 3D Shape Recognition
Spark structured stream writing Hudi practice
Detailed steps of windows installation redis
随机推荐
Flask+supervisor installation realizes background process resident
Logstash+jdbc data synchronization +head display problems
[solution to the new version of Flink without bat startup file]
LeetCode每日一题(2109. Adding Spaces to a String)
Computing level network notes
[point cloud processing paper crazy reading classic version 7] - dynamic edge conditioned filters in revolutionary neural networks on Graphs
Build a solo blog from scratch
Modify idea code
Hudi学习笔记(三) 核心概念剖析
Alibaba cloud notes for the first time
【点云处理之论文狂读经典版11】—— Mining Point Cloud Local Structures by Kernel Correlation and Graph Pooling
CATIA automation object architecture - detailed explanation of application objects (I) document/settingcontrollers
Word segmentation in full-text indexing
Instant messaging IM is the countercurrent of the progress of the times? See what jnpf says
Jenkins learning (III) -- setting scheduled tasks
Overview of database system
[kotlin puzzle] what happens if you overload an arithmetic operator in the kotlin class and declare the operator as an extension function?
[point cloud processing paper crazy reading classic version 10] - pointcnn: revolution on x-transformed points
Derivation of Fourier transform
Introduction to the basic application and skills of QT