当前位置:网站首页>Hudi vs Delta vs Iceberg
Hudi vs Delta vs Iceberg
2022-07-06 19:35:00 【April day 03】
What is? TPC-DS?
TPC-DS It is a benchmark of data warehouse , from Transaction Processing Performance Council(TPC) Definition .TPC It's a non-profit organization , By the database community in 20 century 80 Established in the late S , Its goal is to develop benchmarks that can be objectively used to test the performance of database systems by simulating real-world scenarios .TPC Has had a significant impact on the database industry .
" Help make decisions "(Decision Support) yes TPC-DS Medium "DS" What it stands for .TPC-DS contain 99 A query , From simple aggregation to advanced pattern analysis .
1. We have publicly shared our views on Delta Modification of benchmark framework [8], To support the passage of Spark Datasource or Spark SQL establish Hudi surface . This can be switched dynamically in the benchmark definition .
2. TPC-DS Loading does not involve updating .Hudi Loaded databeans The configuration uses an inappropriate write operation
upsert
, And clearly record [9] 了 Hudibulk-insert
[10] Is the recommended write operation for this use case . Besides , We adjusted Hudi parquet File size set to match Delta Lake The default value is .
CREATE TABLE ...
USING HUDI
OPTIONS (
type = 'cow',
primaryKey = '...',
precombineField = '',
'hoodie.datasource.write.hive_style_partitioning' = 'true',
-- Disable Hudi’s record-level metadata for updates, incremental processing, etc
'hoodie.populate.meta.fields' = 'false',
-- Use “bulk-insert” write-operation instead of default “upsert”
'hoodie.sql.insert.mode' = 'non-strict',
'hoodie.sql.bulk.insert.enable' = 'true',
-- Perform bulk-insert w/o sorting or automatic file-sizing
'hoodie.bulkinsert.sort.mode' = 'NONE',
-- Increasing the file-size to match Delta’s setting
'hoodie.parquet.max.file.size' = '141557760',
'hoodie.parquet.block.size' = '141557760',
'hoodie.parquet.compression.codec' = 'snappy',
– All TPC-DS tables are actually relatively small and don’t require the use of MT table (S3 file-listing is sufficient)
'hoodie.metadata.enable' = 'false',
'hoodie.parquet.writelegacyformat.enabled' = 'false'
)
LOCATION '...'
Hudi The origin of [11] Rooted in incremental data processing , To turn all old batch jobs into increments [12]. therefore ,Hudi The default configuration of is for incremental update insertion and for incremental ETL Pipeline generates change flow , Treat the initial load as a rare one-time operation . Therefore, we need to pay more attention to the loading time in order to be consistent with Delta Comparable
4. Run benchmark
4.1 load
You can see it clearly ,Delta and Hudi stay 0.11.1 The error in the version is 6% within , At present Hudi Of master* The mean square error is 5% within ( We are also right Hudi Of master The branch has been benchmarked , Because we have been Parquet An error was found in the encoding configuration [13] It has been solved in time ). by Hudi In primitive Parquet The rich feature set provided above the table provides support , for example :
• Incremental processing [14]( Because in the timestamp t Submit )
• Record level index [15]( Support record level search 、 Update and delete ),
There are more ,Hudi A set of additional metadata and each called meta field are stored internally [16] The record of . because tpc-ds Mainly focus on snapshot query , In this particular experiment , These fields have been disabled ( And not calculated ),Hudi Leave them blank , In order to open them in the future without pattern evolution . Add five such fields as null , Although the cost is very low , But still can not be ignored .
4.2 Inquire about
As we can see ,Hudi 0.11.1 and Delta 1.2.0 There is little difference in performance , and Hudi current master Faster (~5%). You can go to Google Drive The original log was found in this directory on :
• Hudi 0.11: load [17]/ Inquire about [18]
• Hudi master: load [19]/ Inquire about [20]
• Delta 1.2.0: load [21]/ Inquire about [22]
• Delta 2.0.0 rc1: load [23]/ Inquire about [24]
To reproduce the above results , Please use our in Delta Benchmark repository [25] And follow the steps in the readme .
5. Conclusion
To make a long story short , We want to emphasize the importance of openness and repeatability in such sensitive and complex areas as performance benchmarking . As we have seen over and over again , Obtaining reliable and reliable benchmark results is tedious and challenging , Need dedication 、 Diligent and rigorous support . Looking forward to the future , We plan to release more internal benchmarks , highlight Hudi How the rich feature set achieves unparalleled performance levels in other common industry workloads . Stay tuned !
Environment building
In this benchmark , We used Delta 1.0 and Iceberg 0.13.0, The environment configuration is listed in the following table .
As mentioned earlier , We used Delta Oss The open source TPC-DS The benchmark [5], And it is extended to support Iceberg. We recorded Load performance , That is to remove data from Parquet Format loaded into Delta/Iceberg The time required in the table . then , We also recorded Query performance . Every TPC-DS The query is run three times , Use the average running time as the result .
test result
1. Overall performance
After completing the benchmark , We found that whether it was Load still Query, The overall performance is Delta better , Because it's better than Iceberg fast 3.5 times . Load data into Delta And implement TPC-DS The query needs 1.68 Hours , and Iceberg You need to 5.99 Hours .
2. Load performance
When from Parquet When the file loads data into two formats ,Delta The overall performance is better than Iceberg fast 1.3 times .
For further analysis Load Performance results , We delved into the detailed loading results of each table , And notice that when the size of the table becomes larger , The difference in loading time will become larger . for example , When loading customer Table time ,Delta and Iceberg The performance of is actually the same . On the other hand , In the load store_sales surface , That is to say TPC-DS One of the largest tables in the benchmark ,Delta Than Iceberg fast 1.5 times .
This shows that , When loading data ,Delta Than Iceberg faster 、 Better scalability .
3. Query performance
In execution TPC-DS When inquiring ,Delta The overall performance ratio of Iceberg fast 4.5 times . stay Delta Executing all queries on requires 1.14 Hours , And in the Iceberg Executing the same query on requires 5.27 Hours .
Iceberg and Delta stay query34、query41、query46 and query68 It shows basically the same performance . The difference in these queries is less than 1 second .
However , In other TPC-DS Querying ,Delta All ratio Iceberg fast , And the level of difference varies .
In some queries , Such as query72,Delta Than Iceberg fast 66 times .
In other queries ,Delta and Iceberg The difference between 1.1 times To 24 times Between , All are Delta faster .
summary
After running the benchmark ,Delta In terms of scalability and performance, it exceeds Iceberg, And the range is sometimes unexpectedly large . This benchmark provides a clear answer for us and our customers , Which solution should be selected when building the data Lake warehouse .
It should also be pointed out that ,Iceberg and Delta Are constantly improving , As they improve , We will continue to pay attention to their performance , And share our results in the wider community .
If you want to further analyze and refine your opinion from the benchmark results , You can download the complete benchmark report here [6].
Original address :
https://databeans-blogs.medium.com/delta-vs-iceberg-performance-as-a-decisive-criteria-add7bcdde03d
边栏推荐
- CPU负载很低,loadavg很高处理方法
- 【pytorch】yolov5 训练自己的数据集
- Don't miss this underestimated movie because of controversy!
- [translation] a GPU approach to particle physics
- [玩转Linux] [Docker] MySQL安装和配置
- 思維導圖+源代碼+筆記+項目,字節跳動+京東+360+網易面試題整理
- Yyds dry goods inventory leetcode question set 751 - 760
- ZABBIX proxy server and ZABBIX SNMP monitoring
- ModuleNotFoundError: No module named ‘PIL‘解决方法
- Actf 2022 came to a successful conclusion, and 0ops team won the second consecutive championship!!
猜你喜欢
倒计时2天|腾讯云消息队列数据接入平台(Data Import Platform)直播预告
Interface test tool - postman
【计算情与思】扫地僧、打字员、信息恐慌与奥本海默
Interview assault 63: how to remove duplication in MySQL?
利用 clip-path 绘制不规则的图形
Reflection and illegalaccessexception exception during application
Characteristic colleges and universities, jointly build Netease Industrial College
思維導圖+源代碼+筆記+項目,字節跳動+京東+360+網易面試題整理
LeetCode-1279. 红绿灯路口
MySQL information schema learning (I) -- general table
随机推荐
1805. 字符串中不同整数的数目
Black Horse - - Redis Chapter
RT-Thread 组件 FinSH 使用时遇到的问题
CCNP Part 11 BGP (III) (essence)
LeetCode-1279. Traffic light intersection
学习探索-无缝轮播图
About image reading and processing, etc
Cereals Mall - Distributed Advanced p129~p339 (end)
How to customize animation avatars? These six free online cartoon avatar generators are exciting at a glance!
Simple understanding of MySQL database
Carte de réflexion + code source + notes + projet, saut d'octets + jd + 360 + tri des questions d'entrevue Netease
谷粒商城--分布式高级篇P129~P339(完结)
【翻译】供应链安全项目in-toto移至CNCF孵化器
测试用里hi
MySQL information schema learning (I) -- general table
Php+redis realizes the function of canceling orders over time
spark基础-scala
English topic assignment (25)
IC设计流程中需要使用到的文件
Solution of intelligent management platform for suppliers in hardware and electromechanical industry: optimize supply chain management and drive enterprise performance growth