Halodoc Data engineering has changed from traditional data platform 1.0 Develop to use LakeHouse Architecture of modern data platform 2.0 The reconstruction of . In our previous blog , We mentioned how we were Halodoc The implementation of Lakehouse Architecture to serve large-scale analysis workloads . We mentioned the platform 2.0 Design considerations during construction 、 Best practices and learning .
In this blog we will introduce Apache Hudi And how it helps us build a transactional data Lake . We will also focus on building Lakehouse Some of the challenges we face when we're doing this , And how we use Apache Hudi Overcome these challenges .
Apache Hudi
Let's start from the right Apache Hudi The basic understanding of . Hudi It is a rich platform , It is used to build a streaming data lake with incremental data pipeline on the self managed database layer , At the same time, the lake engine and conventional batch processing are optimized .
Apache Hudi The core warehouse and database functions are directly introduced into the data lake . Hudi Provide table 、 Business 、 efficient upserts/deletes、 Advanced index 、 Streaming ingestion service 、 data Clustering/ Compression optimization and concurrency , At the same time, keep the data in the open source file format .
Apache Hudi It can be easily used on any cloud storage platform . Apache Hudi Advanced performance optimization for , Make use of any popular query engine ( Include Apache Spark、Flink、Presto、Trino、Hive etc. ) Our analysis workload is faster .
Let's look at building Lakehouse Some of the key challenges encountered , And how we use Hudi and AWS Cloud services address these challenges .
stay LakeHouse Execute increment in Upsert
One of the main challenges that everyone faces when building a transactional data lake is to determine the correct primary key to update the records in the data Lake . In most cases, the primary key is used as the unique identifier and timestamp field to filter duplicate records in the incoming batch .
stay Halodoc, Most microservices use RDS MySQL As data storage . We have 50 Multiple MySQL The database needs to be migrated to the data lake , Trading goes through various states , And in most cases, updates often occur .
problem :
MySQL RDS Store timestamp fields in seconds , This makes it difficult to track transactions that occur in milliseconds or even microseconds , Using the timestamp field of business modification to identify the latest transactions in the incoming batch is a challenge for us .
We have tried many ways to solve this problem , By using rank Function or combine multiple fields and select the correct compound key . Selecting compound keys is not uniform in the table , And different logic may be required to identify the latest transactions .
Solution :
AWS Data Migration Service Can be configured to have a transformation rule that can add additional headers with custom or predefined attributes .
ar_h_change_seq: Unique incremental number from the source database , It consists of a timestamp and an automatically incrementing number . This value depends on the source database system .
Headers help us easily filter out duplicate records , And we can update the latest records in the data lake . The header will only be applied to changes in progress . For full load , By default, we assign to the record 0, In incremental records , We attach a unique identifier to each record . We are precombine Field ar_h_change_seq To remove duplicate records from the incoming batch .
Hudi To configure :
precombine = ar_h_change_seq
hoodie.datasource.write.precombine.field: precombine
hoodie.datasource.write.payload.class: 'org.apache.hudi.common.model.DefaultHoodieRecordPayload'
hoodie.payload.ordering.field: precombine

Small file problems in the data lake
When building the data Lake , There will be frequent updates / Insert , As a result, there are many small files in each partition .
problem :
Let's see how small files cause problems when querying . When a query is triggered to extract or transform a dataset ,Driver The node must collect metadata for each file , This leads to performance overhead in the conversion process .
Solution :
Compressing small files regularly helps maintain the correct file size , To improve query performance . and Apache Hudi Support synchronous and asynchronous compression .
- Synchronous compression : This can be enabled during the write process itself , This will increase ETL Execute time to update Hudi Records in .
- Asynchronous compression : Compression can be achieved through different processes , And you need separate memory to implement . This does not affect the write process , It is also a scalable solution .
stay Halodoc, We first use synchronous compression . Slowly, , We plan to use table size based 、 Growth and mixed compression of use cases .
Hudi To configure :
hoodie.datasource.clustering.inline.enable
hoodie.datasource.compaction.async.enable
Maintain storage size to reduce costs
Data lake is very cheap , This does not mean that we should store data that is not needed for business analysis . Otherwise, we will soon see higher and higher storage costs . Apache Hudi Will be in every upsert Maintain the file version during operation , To provide time travel queries for records . Each submission creates a new version of the file , To create a large number of versioned files .
problem :
If we do not enable the cleanup policy , Then the storage size will grow exponentially , Directly affect storage costs . If there is no business value , Older submissions must be cleared .
Solution :
Hudi There are two cleanup strategies , Based on file version and count ( Number of submissions to keep ). stay Halodoc, We calculated the frequency at which the write occurred and ETL The time required to complete the process , Based on this, we put forward some suggestions to be kept in Hudi Submission in data set .
Example : If every 5 Every minute, the data will be taken to Hudi The homework , And the longest running query may need 1 Hours to complete , Then the platform shall at least retain 60/5 = 12 Submission .
Hudi To configure :
hoodie.cleaner.policy: KEEP_LATEST_COMMITS
hoodie.cleaner.commits.retained: 12
perhaps
hoodie.cleaner.policy: KEEP_LATEST_FILE_VERSIONS
hoodie.cleaner.fileversions.retained: 1
Choose the right storage type based on latency and business use cases
Apache Hudi There are two types of storage , Data sets for storing different use cases . Once a storage type is selected , change / Updating to another type can be a tedious process (CoW Changed to: MoR Relatively easy ,MoR Changed to: CoW More troublesome ). So we are migrating data to Hudi It is very important to select the correct storage type before data sets .
problem :
Choosing the wrong storage type may affect ETL Execution times and expected data delays for data consumers .
Solution :
stay Halodoc We use both storage types for our workloads .
MoR:MoR Represents merge on read . We have selected... For the tables that need immediate read access after writing MoR. It also reduces upsert Time , because Hudi Log maintenance for incremental changes AVRO file , And you don't have to rewrite the existing parquet file .
MoR Provide data sets _ro and _rt Of 2 Individual view .
- _ro Used to read optimization table .
- _rt For real-time tables .
CoW:CoW Copy on write . Storage type CoW Selected for data delay 、 Data sets with lower update cost and write amplification priority but higher read performance priority .
type = COPY_ON_WRITE / MERGE_ON_READ
hoodie.datasource.write.table.type: type
The list of files is heavy ,Hudi How to solve
Generally speaking, distributed object storage or file system upsert And updates are expensive , Because these systems are inherently immutable , It involves tracking and identifying the subset of files that need to be updated , And overwrite the file with a new version containing the latest records . Apache Hudi Store metadata for each file slice and filegroup , To track and update the record of the insert operation .
problem :
As mentioned earlier , There are a large number of files in different partitions Driver The cost of nodes collecting information , This will result in memory / Calculation problem .
Solution :
To solve this problem ,Hudi The concept of metadata is introduced , This means that all file information is stored in a separate table , And synchronize when the source changes . This will help Spark Read or execute a list of files from one location , So as to achieve the best use of resources .
These can be easily achieved with the following configurations .
Hudi To configure
hoodie.metadata.enabled: true
by Hudi Select the correct index for the dataset
Using indexes in traditional databases to effectively retrieve data from tables . Apache Hudi There is also the concept of indexing , But it works in a slightly different way . Hudi The index in is mainly used to enforce the uniqueness of keys across all partitions of the table .
problem :
When you want to build a transaction data Lake , maintain / It is always important to limit duplicate records in each partition or global partition
Solution :
Hudi By using Hudi The index in the dataset solves this problem , It provides global and non global indexes . By default Bloom Index. at present Hudi Support :
- Bloom Index: Use the... Built from the record key Bloom filter , You can also choose to trim candidate files using the record key range .
- Simple Index: Update the records and incoming data in the storage table / Delete records for connection .
- Hbase Index: Manage external Apache HBase Index mapping in the table .
stay Halodoc, We take advantage of the overall situation Bloom Indexes , So that the record is unique in the partition , When using indexes, decisions must be made based on the source behavior or whether someone wants to maintain a copy .
summary
stay Halodoc In the past 6 We have been using for months Apache Hudi, It has always served large-scale data workloads well . At the beginning of Apache Hudi Choosing the right configuration involves some learning curve .
In this blog , We shared what we are building LakeHouse Some of the problems encountered in , And use in the production environment Apache Hudi Configure parameters correctly / Best practices for configuration .









