当前位置:网站首页>Hudi vs Delta vs Iceberg
Hudi vs Delta vs Iceberg
2022-07-06 19:35:00 【April day 03】
What is? TPC-DS?
TPC-DS It is a benchmark of data warehouse , from Transaction Processing Performance Council(TPC) Definition .TPC It's a non-profit organization , By the database community in 20 century 80 Established in the late S , Its goal is to develop benchmarks that can be objectively used to test the performance of database systems by simulating real-world scenarios .TPC Has had a significant impact on the database industry .
" Help make decisions "(Decision Support) yes TPC-DS Medium "DS" What it stands for .TPC-DS contain 99 A query , From simple aggregation to advanced pattern analysis .
1. We have publicly shared our views on Delta Modification of benchmark framework [8], To support the passage of Spark Datasource or Spark SQL establish Hudi surface . This can be switched dynamically in the benchmark definition .
2. TPC-DS Loading does not involve updating .Hudi Loaded databeans The configuration uses an inappropriate write operation
upsert
, And clearly record [9] 了 Hudibulk-insert
[10] Is the recommended write operation for this use case . Besides , We adjusted Hudi parquet File size set to match Delta Lake The default value is .
CREATE TABLE ...
USING HUDI
OPTIONS (
type = 'cow',
primaryKey = '...',
precombineField = '',
'hoodie.datasource.write.hive_style_partitioning' = 'true',
-- Disable Hudi’s record-level metadata for updates, incremental processing, etc
'hoodie.populate.meta.fields' = 'false',
-- Use “bulk-insert” write-operation instead of default “upsert”
'hoodie.sql.insert.mode' = 'non-strict',
'hoodie.sql.bulk.insert.enable' = 'true',
-- Perform bulk-insert w/o sorting or automatic file-sizing
'hoodie.bulkinsert.sort.mode' = 'NONE',
-- Increasing the file-size to match Delta’s setting
'hoodie.parquet.max.file.size' = '141557760',
'hoodie.parquet.block.size' = '141557760',
'hoodie.parquet.compression.codec' = 'snappy',
– All TPC-DS tables are actually relatively small and don’t require the use of MT table (S3 file-listing is sufficient)
'hoodie.metadata.enable' = 'false',
'hoodie.parquet.writelegacyformat.enabled' = 'false'
)
LOCATION '...'
Hudi The origin of [11] Rooted in incremental data processing , To turn all old batch jobs into increments [12]. therefore ,Hudi The default configuration of is for incremental update insertion and for incremental ETL Pipeline generates change flow , Treat the initial load as a rare one-time operation . Therefore, we need to pay more attention to the loading time in order to be consistent with Delta Comparable
4. Run benchmark
4.1 load
You can see it clearly ,Delta and Hudi stay 0.11.1 The error in the version is 6% within , At present Hudi Of master* The mean square error is 5% within ( We are also right Hudi Of master The branch has been benchmarked , Because we have been Parquet An error was found in the encoding configuration [13] It has been solved in time ). by Hudi In primitive Parquet The rich feature set provided above the table provides support , for example :
• Incremental processing [14]( Because in the timestamp t Submit )
• Record level index [15]( Support record level search 、 Update and delete ),
There are more ,Hudi A set of additional metadata and each called meta field are stored internally [16] The record of . because tpc-ds Mainly focus on snapshot query , In this particular experiment , These fields have been disabled ( And not calculated ),Hudi Leave them blank , In order to open them in the future without pattern evolution . Add five such fields as null , Although the cost is very low , But still can not be ignored .
4.2 Inquire about
As we can see ,Hudi 0.11.1 and Delta 1.2.0 There is little difference in performance , and Hudi current master Faster (~5%). You can go to Google Drive The original log was found in this directory on :
• Hudi 0.11: load [17]/ Inquire about [18]
• Hudi master: load [19]/ Inquire about [20]
• Delta 1.2.0: load [21]/ Inquire about [22]
• Delta 2.0.0 rc1: load [23]/ Inquire about [24]
To reproduce the above results , Please use our in Delta Benchmark repository [25] And follow the steps in the readme .
5. Conclusion
To make a long story short , We want to emphasize the importance of openness and repeatability in such sensitive and complex areas as performance benchmarking . As we have seen over and over again , Obtaining reliable and reliable benchmark results is tedious and challenging , Need dedication 、 Diligent and rigorous support . Looking forward to the future , We plan to release more internal benchmarks , highlight Hudi How the rich feature set achieves unparalleled performance levels in other common industry workloads . Stay tuned !
Environment building
In this benchmark , We used Delta 1.0 and Iceberg 0.13.0, The environment configuration is listed in the following table .
As mentioned earlier , We used Delta Oss The open source TPC-DS The benchmark [5], And it is extended to support Iceberg. We recorded Load performance , That is to remove data from Parquet Format loaded into Delta/Iceberg The time required in the table . then , We also recorded Query performance . Every TPC-DS The query is run three times , Use the average running time as the result .
test result
1. Overall performance
After completing the benchmark , We found that whether it was Load still Query, The overall performance is Delta better , Because it's better than Iceberg fast 3.5 times . Load data into Delta And implement TPC-DS The query needs 1.68 Hours , and Iceberg You need to 5.99 Hours .
2. Load performance
When from Parquet When the file loads data into two formats ,Delta The overall performance is better than Iceberg fast 1.3 times .
For further analysis Load Performance results , We delved into the detailed loading results of each table , And notice that when the size of the table becomes larger , The difference in loading time will become larger . for example , When loading customer Table time ,Delta and Iceberg The performance of is actually the same . On the other hand , In the load store_sales surface , That is to say TPC-DS One of the largest tables in the benchmark ,Delta Than Iceberg fast 1.5 times .
This shows that , When loading data ,Delta Than Iceberg faster 、 Better scalability .
3. Query performance
In execution TPC-DS When inquiring ,Delta The overall performance ratio of Iceberg fast 4.5 times . stay Delta Executing all queries on requires 1.14 Hours , And in the Iceberg Executing the same query on requires 5.27 Hours .
Iceberg and Delta stay query34、query41、query46 and query68 It shows basically the same performance . The difference in these queries is less than 1 second .
However , In other TPC-DS Querying ,Delta All ratio Iceberg fast , And the level of difference varies .
In some queries , Such as query72,Delta Than Iceberg fast 66 times .
In other queries ,Delta and Iceberg The difference between 1.1 times To 24 times Between , All are Delta faster .
summary
After running the benchmark ,Delta In terms of scalability and performance, it exceeds Iceberg, And the range is sometimes unexpectedly large . This benchmark provides a clear answer for us and our customers , Which solution should be selected when building the data Lake warehouse .
It should also be pointed out that ,Iceberg and Delta Are constantly improving , As they improve , We will continue to pay attention to their performance , And share our results in the wider community .
If you want to further analyze and refine your opinion from the benchmark results , You can download the complete benchmark report here [6].
Original address :
https://databeans-blogs.medium.com/delta-vs-iceberg-performance-as-a-decisive-criteria-add7bcdde03d
边栏推荐
- LeetCode_ Double pointer_ Medium_ 61. rotating linked list
- ZABBIX proxy server and ZABBIX SNMP monitoring
- Use of deg2rad and rad2deg functions in MATLAB
- 学习探索-函数防抖
- LeetCode_格雷编码_中等_89.格雷编码
- swagger2报错Illegal DefaultValue null for parameter type integer
- [translation] supply chain security project in toto moved to CNCF incubator
- Help improve the professional quality of safety talents | the first stage of personal ability certification and assessment has been successfully completed!
- A popular explanation will help you get started
- 凤凰架构2——访问远程服务
猜你喜欢
Sanmian ant financial successfully got the offer, and has experience in Android development agency recruitment and interview
受益匪浅,安卓面试问题
JDBC details
Live broadcast today | the 2022 Hongji ecological partnership conference of "Renji collaboration has come" is ready to go
Swiftui game source code Encyclopedia of Snake game based on geometryreader and preference
Solution of commercial supply chain management platform for packaging industry: layout smart supply system and digitally integrate the supply chain of packaging industry
学习探索-使用伪元素清除浮动元素造成的高度坍塌
蓝桥杯 微生物增殖 C语言
Mysql Information Schema 学习(二)--Innodb表
反射及在运用过程中出现的IllegalAccessException异常
随机推荐
C # use Marshall to manually create unmanaged memory in the heap and use
Excel 中VBA脚本的简单应用
swagger2报错Illegal DefaultValue null for parameter type integer
usb host 驱动 - UVC 掉包
IC设计流程中需要使用到的文件
终于可以一行代码也不用改了!ShardingSphere 原生驱动问世
保证接口数据安全的10种方案
Actf 2022 came to a successful conclusion, and 0ops team won the second consecutive championship!!
Yyds dry goods inventory leetcode question set 751 - 760
About image reading and processing, etc
First day of rhcsa study
数学知识——高斯消元(初等行变换解方程组)代码实现
121. 买卖股票的最佳时机
潇洒郎: AttributeError: partially initialized module ‘cv2‘ has no attribute ‘gapi_wip_gst_GStreamerPipe
凤凰架构2——访问远程服务
C language daily practice - day 22: Zero foundation learning dynamic planning
Sanmian ant financial successfully got the offer, and has experience in Android development agency recruitment and interview
Help improve the professional quality of safety talents | the first stage of personal ability certification and assessment has been successfully completed!
Low CPU load and high loadavg processing method
【基础架构】Flink/Flink-CDC的部署和配置(MySQL / ES)