Introduction:
After comparing delta vs iceberg in our previous blog, a lot of people asked for benchmarking their latest versions and for Apache Hudi to be thrown into the mix.
So, by popular demand, we did exactly that and we performed TPC-DS on Delta 1.2.0 , Iceberg 0.13.1 and Hudi 0.11.1 using Apache Spark 3.2.0.
What is TPC-DS?
TPC-DS is a data warehousing benchmark defined by the Transaction Processing Performance Council (TPC). TPC is a non-profit organization founded by the database community in the late 1980s with the goal of developing benchmarks that may be used objectively to test database system performance by simulating real-world scenarios. TPC has had a significant impact on the database industry.
“Decision support” is what the “DS” in TPC-DS stands for. There are 99 queries in total, ranging from simple aggregations to advanced pattern analysis.
Environment setup:
In this benchmark we used Hudi 0.11.1 with COW table type, Delta 1.2.0 and Iceberg 0.13.1 with the environment components listed in the table below:
How did we do it ?
As discussed earlier, we used the open sourced TPC-DS benchmark by Delta oss and we extended it to support Iceberg and hudi. We registered load performance, which is the time it takes to load data from Parquet format into Delta/iceberg/hudi tables. We then registered query performance. Each TPC-DS query was run three times, and the average running time was taken into account.
Benchmark results:
- Overall performance
After completing the benchmark, the overall performance of loading and querying the tables was in favour of Delta as it was 1.7X faster than Iceberg and 4.3X faster then Hudi. It took 1.75 hours to Load data into Delta and perform the TPC-DS queries and it took 2.97 hours for iceberg and 7.65 hours for Hudi to do the same.[chart-1]
2. Load performance
When loading data from Parquet to our intended formats, Delta was 1.17X faster than iceberg. However, Delta was 9.7X faster than Hudi in overall performance.[chart-2]
To further analyse the load performance results, we dived into the detailed load results of each table, and noticed that the difference in load time, compared to Hudi, gets wider when the table size gets bigger. For example, when loading the customer table, Delta and Iceberg had practically the same performance and they were 2.8X faster than Hudi. Meanwhile, in the store_sales table, which is one of the biggest tables in the TPC-DS benchmark, Delta was 1.2X faster than Iceberg and 11.5X faster than Hudi.
This also shows that since the previous benchmark, the difference in load performance between Delta and Iceberg has shrunk by 0.3X times.[chart-3]
3. Query performance
When performing the TPC-DS queries, Delta was 1.39X faster than Hudi and 1.99X faster than Iceberg in overall performance. It took 1.12 hours to perform all queries on Delta and it took 1.5 hours for Hudi and 2.23 hours for Iceberg to do the same.[chart-4]
To further analyse the query performance results, we ranked formats based on the number of queries they finished first. Delta was faster than Iceberg and Hudi in 68 queries out of all the TPC-DS 99 queries .Hudi was faster in 31 of them, however, Iceberg never outperformed Delta and Hudi in any query.
As for the performance difference, Iceberg, Delta and Hudi delivered approximately the same performance in query1, query15, query20 and query33.
We also noticed that when Hudi outperformed Delta, it was exclusively in queries that had a short execution duration (less than 30 sec) and the difference often ranged between 1.1X to 4.3X faster in favour of Hudi. However, Delta was faster in all time-demanding queries by margins that ranged from 1.5X (in query 23 which is the most time- demanding query) to 2.6X (in query 88).[chart-5]
Conclusion:
After running the benchmark, Delta outperformed Iceberg and Hudi in loading and querying the data. The out-of-the-box load performance of Hudi was unexpectedly slow and significantly impacted its overall benchmark results.
It is also important to state that Iceberg performed 2X faster than last time, but it remained slower than Delta and was slower than Hudi in query performance despite the performance improvement.
To further analyse and extract your own insights from this benchmark, you can download the full benchmark reports here.