当前位置:网站首页>Halodoc's key experience in building Lakehouse using Apache Hudi

Halodoc's key experience in building Lakehouse using Apache Hudi

2022-06-10 03:13:00 leesf

Halodoc Data engineering has changed from traditional data platform 1.0 Develop to use LakeHouse Architecture of modern data platform 2.0 The reconstruction of . In our previous blog , We mentioned how we were Halodoc The implementation of Lakehouse Architecture to serve large-scale analysis workloads . We mentioned the platform 2.0 Design considerations during construction 、 Best practices and learning .
In this blog we will introduce Apache Hudi And how it helps us build a transactional data Lake . We will also focus on building Lakehouse Some of the challenges we face when we're doing this , And how we use Apache Hudi Overcome these challenges .

Apache Hudi

Let's start from the right Apache Hudi The basic understanding of . Hudi It is a rich platform , It is used to build a streaming data lake with incremental data pipeline on the self managed database layer , At the same time, the lake engine and conventional batch processing are optimized .
Apache Hudi The core warehouse and database functions are directly introduced into the data lake . Hudi Provide table 、 Business 、 efficient upserts/deletes、 Advanced index 、 Streaming ingestion service 、 data Clustering/ Compression optimization and concurrency , At the same time, keep the data in the open source file format .
Apache Hudi It can be easily used on any cloud storage platform . Apache Hudi Advanced performance optimization for , Make use of any popular query engine ( Include Apache Spark、Flink、Presto、Trino、Hive etc. ) Our analysis workload is faster .
Let's look at building Lakehouse Some of the key challenges encountered , And how we use Hudi and AWS Cloud services address these challenges .

stay LakeHouse Execute increment in Upsert

One of the main challenges that everyone faces when building a transactional data lake is to determine the correct primary key to update the records in the data Lake . In most cases, the primary key is used as the unique identifier and timestamp field to filter duplicate records in the incoming batch .
stay Halodoc, Most microservices use RDS MySQL As data storage . We have 50 Multiple MySQL The database needs to be migrated to the data lake , Trading goes through various states , And in most cases, updates often occur .

problem :
MySQL RDS Store timestamp fields in seconds , This makes it difficult to track transactions that occur in milliseconds or even microseconds , Using the timestamp field of business modification to identify the latest transactions in the incoming batch is a challenge for us .
We have tried many ways to solve this problem , By using rank Function or combine multiple fields and select the correct compound key . Selecting compound keys is not uniform in the table , And different logic may be required to identify the latest transactions .

Solution :
AWS Data Migration Service Can be configured to have a transformation rule that can add additional headers with custom or predefined attributes .

ar_h_change_seq: Unique incremental number from the source database , It consists of a timestamp and an automatically incrementing number . This value depends on the source database system .

Headers help us easily filter out duplicate records , And we can update the latest records in the data lake . The header will only be applied to changes in progress . For full load , By default, we assign to the record 0, In incremental records , We attach a unique identifier to each record . We are precombine Field ar_h_change_seq To remove duplicate records from the incoming batch .

Hudi To configure :

precombine = ar_h_change_seq
hoodie.datasource.write.precombine.field: precombine
hoodie.datasource.write.payload.class: 'org.apache.hudi.common.model.DefaultHoodieRecordPayload'
hoodie.payload.ordering.field: precombine

Small file problems in the data lake

When building the data Lake , There will be frequent updates / Insert , As a result, there are many small files in each partition .

problem :
Let's see how small files cause problems when querying . When a query is triggered to extract or transform a dataset ,Driver The node must collect metadata for each file , This leads to performance overhead in the conversion process .

Solution :
Compressing small files regularly helps maintain the correct file size , To improve query performance . and Apache Hudi Support synchronous and asynchronous compression .

  • Synchronous compression : This can be enabled during the write process itself , This will increase ETL Execute time to update Hudi Records in .
  • Asynchronous compression : Compression can be achieved through different processes , And you need separate memory to implement . This does not affect the write process , It is also a scalable solution .

stay Halodoc, We first use synchronous compression . Slowly, , We plan to use table size based 、 Growth and mixed compression of use cases .

Hudi To configure :

hoodie.datasource.clustering.inline.enable
hoodie.datasource.compaction.async.enable

Maintain storage size to reduce costs

Data lake is very cheap , This does not mean that we should store data that is not needed for business analysis . Otherwise, we will soon see higher and higher storage costs . Apache Hudi Will be in every upsert Maintain the file version during operation , To provide time travel queries for records . Each submission creates a new version of the file , To create a large number of versioned files .

problem :
If we do not enable the cleanup policy , Then the storage size will grow exponentially , Directly affect storage costs . If there is no business value , Older submissions must be cleared .

Solution :
Hudi There are two cleanup strategies , Based on file version and count ( Number of submissions to keep ). stay Halodoc, We calculated the frequency at which the write occurred and ETL The time required to complete the process , Based on this, we put forward some suggestions to be kept in Hudi Submission in data set .
Example : If every 5 Every minute, the data will be taken to Hudi The homework , And the longest running query may need 1 Hours to complete , Then the platform shall at least retain 60/5 = 12 Submission .

Hudi To configure :

hoodie.cleaner.policy: KEEP_LATEST_COMMITS
hoodie.cleaner.commits.retained: 12

perhaps

hoodie.cleaner.policy: KEEP_LATEST_FILE_VERSIONS
hoodie.cleaner.fileversions.retained: 1

Choose the right storage type based on latency and business use cases

Apache Hudi There are two types of storage , Data sets for storing different use cases . Once a storage type is selected , change / Updating to another type can be a tedious process (CoW Changed to: MoR Relatively easy ,MoR Changed to: CoW More troublesome ). So we are migrating data to Hudi It is very important to select the correct storage type before data sets .

problem :
Choosing the wrong storage type may affect ETL Execution times and expected data delays for data consumers .

Solution :
stay Halodoc We use both storage types for our workloads .
MoR:MoR Represents merge on read . We have selected... For the tables that need immediate read access after writing MoR. It also reduces upsert Time , because Hudi Log maintenance for incremental changes AVRO file , And you don't have to rewrite the existing parquet file .
MoR Provide data sets _ro and _rt Of 2 Individual view .

  • _ro Used to read optimization table .
  • _rt For real-time tables .

CoW:CoW Copy on write . Storage type CoW Selected for data delay 、 Data sets with lower update cost and write amplification priority but higher read performance priority .

type = COPY_ON_WRITE / MERGE_ON_READ
hoodie.datasource.write.table.type: type

The list of files is heavy ,Hudi How to solve

Generally speaking, distributed object storage or file system upsert And updates are expensive , Because these systems are inherently immutable , It involves tracking and identifying the subset of files that need to be updated , And overwrite the file with a new version containing the latest records . Apache Hudi Store metadata for each file slice and filegroup , To track and update the record of the insert operation .

problem :
As mentioned earlier , There are a large number of files in different partitions Driver The cost of nodes collecting information , This will result in memory / Calculation problem .

Solution :
To solve this problem ,Hudi The concept of metadata is introduced , This means that all file information is stored in a separate table , And synchronize when the source changes . This will help Spark Read or execute a list of files from one location , So as to achieve the best use of resources .
These can be easily achieved with the following configurations .
Hudi To configure

hoodie.metadata.enabled: true

by Hudi Select the correct index for the dataset

Using indexes in traditional databases to effectively retrieve data from tables . Apache Hudi There is also the concept of indexing , But it works in a slightly different way . Hudi The index in is mainly used to enforce the uniqueness of keys across all partitions of the table .

problem :
When you want to build a transaction data Lake , maintain / It is always important to limit duplicate records in each partition or global partition

Solution :
Hudi By using Hudi The index in the dataset solves this problem , It provides global and non global indexes . By default Bloom Index. at present Hudi Support :

  • Bloom Index: Use the... Built from the record key Bloom filter , You can also choose to trim candidate files using the record key range .
  • Simple Index: Update the records and incoming data in the storage table / Delete records for connection .
  • Hbase Index: Manage external Apache HBase Index mapping in the table .

stay Halodoc, We take advantage of the overall situation Bloom Indexes , So that the record is unique in the partition , When using indexes, decisions must be made based on the source behavior or whether someone wants to maintain a copy .

summary

stay Halodoc In the past 6 We have been using for months Apache Hudi, It has always served large-scale data workloads well . At the beginning of Apache Hudi Choosing the right configuration involves some learning curve .
In this blog , We shared what we are building LakeHouse Some of the problems encountered in , And use in the production environment Apache Hudi Configure parameters correctly / Best practices for configuration .

原网站

版权声明
本文为[leesf]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/160/202206091423563951.html