当前位置:网站首页>Hudi data management and storage overview
Hudi data management and storage overview
2022-07-03 09:25:00 【Did Xiao Hu get stronger today】
List of articles
Data management
**Hudi How to manage data ? **
Use table Table Formal organization data , And the data class in each table like Hive Partition table , Divide data into different directories according to partition fields , Each data has a primary key PrimaryKey, Identify data uniqueness .
Hudi Data management
Hudi Data files for tables , You can use the operating system's file system storage , You can also use HDFS This kind of distributed file system storage . In order to share Analyze the reliability of performance and data , In general use HDFS For storage . With HDFS In terms of storage , One Hudi Table storage files are divided into two categories .
.hoodie
(1).hoodie file : because CRUD The fragility of , Each operation generates a file , When these little files get more and more , Can have a serious impact HDFS Of performance ,Hudi A set of file merging mechanism is designed . .hoodie The corresponding log files related to the file merge operation are stored in the folder .
Hudi As time goes by , A series of CRUD The operation is called Timeline,Timeline A certain operation in , be called Instant.
- Instant Action, Record this operation is a data submission (COMMITS), Or file merging (COMPACTION), Or file cleaning (CLEANS);
- Instant Time, The time of this operation ;
- State, State of operation , launch (REQUESTED), Have in hand (INFLIGHT), It's still done (COMPLETED);
amricas and asia
(2)amricas and asia The relevant path is the actual data file , Storage by partition , The path of the partition key It can be specified .
- Hudi Real data files use Parquet File format storage ;
- There is one metadata Metadata files and data files parquet The column type storage .
- Hudi In order to realize the data CRUD, Need to be able to uniquely identify a record ,Hudi The only field in the dataset will be (record key ) + Where the data is Partition (partitionPath) Unite as the only key to the data .
Hudi Storage overview
Hudi The organization directory structure of the data set and Hive Very similar , A data set corresponds to this root directory . The dataset is fragmented into multiple partitions , Partition Fields exist as folders , This folder contains all the files in this partition .
In the root directory , Each partition has a unique partition path , Each partition data is stored in multiple files .
Each file has a unique fileId And generate files commit Marked . If an update operation occurs , Multiple files share the same fileId, But it will Different commit.
Metadata Metadata
- Time axis (timeline) Maintain the metadata of various operations on the dataset in the form of , To support the transient view of the dataset , This part of metadata is stored Metadata directory stored in the root directory . There are three types of metadata :
- Commits: A single commit Contains information about an atomic write operation to a batch of data on the dataset . We use monotonically increasing timestamps to identify commits, Calibration is the beginning of a write operation .
- Cleans: The background activity used to clear the old version files in the dataset that are no longer used by the query .
- Compactions: To coordinate Hudi Internal data structure differences in background activities . for example , Collect the update operation from the log file based on row storage to the column storage data .
Index Indexes
- Hudi Maintain an index , To support in recording key In the presence of , The newly recorded key Quickly map to the corresponding fileId.
- Bloom filter: Stored in the footer of the data file . The default option , Independent of external system implementation . The data and index are always consistent .
- Apache HBase : It can efficiently find a small batch of key. During index marking , This option may be a few seconds faster .
Data data
Hudi Store all ingested data in two different storage formats , The user can choose any data format that meets the following conditions :
Read optimized column format (ROFormat): The default value is Apache Parquet;
Write optimized line storage format (WOFormat): The default value is Apache Avro
Reference material :
边栏推荐
- WARNING: You are using pip ; however. Later, upgrade PIP failed, modulenotfounderror: no module named 'pip‘
- Beego learning - JWT realizes user login and registration
- Data mining 2021-4-27 class notes
- How to check whether the disk is in guid format (GPT) or MBR format? Judge whether UEFI mode starts or legacy mode starts?
- Beego learning - Tencent cloud upload pictures
- [set theory] order relation (chain | anti chain | chain and anti chain example | chain and anti chain theorem | chain and anti chain inference | good order relation)
- About the configuration of vs2008+rade CATIA v5r22
- Just graduate student reading thesis
- 【点云处理之论文狂读经典版11】—— Mining Point Cloud Local Structures by Kernel Correlation and Graph Pooling
- LeetCode每日一题(2305. Fair Distribution of Cookies)
猜你喜欢
[point cloud processing paper crazy reading cutting-edge version 12] - adaptive graph revolution for point cloud analysis
There is no open in default browser option in the right click of the vscade editor
Trial of the combination of RDS and crawler
Spark cluster installation and deployment
Crawler career from scratch (II): crawl the photos of my little sister ② (the website has been disabled)
Django operates Excel files through openpyxl to import data into the database in batches.
Flink-CDC实践(含实操步骤与截图)
Alibaba cloud notes for the first time
Move anaconda, pycharm and jupyter notebook to mobile hard disk
Hudi学习笔记(三) 核心概念剖析
随机推荐
【点云处理之论文狂读经典版9】—— Pointwise Convolutional Neural Networks
Solve POM in idea Comment top line problem in XML file
Numerical analysis notes (I): equation root
Powerdesign reverse wizard such as SQL and generates name and comment
The server denied password root remote connection access
[point cloud processing paper crazy reading classic version 11] - mining point cloud local structures by kernel correlation and graph pooling
The less successful implementation and lessons of RESNET
Install third-party libraries such as Jieba under Anaconda pytorch
【点云处理之论文狂读前沿版13】—— GAPNet: Graph Attention based Point Neural Network for Exploiting Local Feature
MySQL installation and configuration (command line version)
CATIA automation object architecture - detailed explanation of application objects (I) document/settingcontrollers
【点云处理之论文狂读经典版8】—— O-CNN: Octree-based Convolutional Neural Networks for 3D Shape Analysis
[kotlin learning] classes, objects and interfaces - define class inheritance structure
【点云处理之论文狂读经典版12】—— FoldingNet: Point Cloud Auto-encoder via Deep Grid Deformation
LeetCode每日一题(2109. Adding Spaces to a String)
Win10 quick screenshot
Hudi 数据管理和存储概述
【点云处理之论文狂读经典版14】—— Dynamic Graph CNN for Learning on Point Clouds
PowerDesigner does not display table fields, only displays table names and references, which can be modified synchronously
Install database -linux-5.7