Delta vs Iceberg : Performance as a decisive criteria

Introduction :

A data Lakehouse is an open data architecture that brings together the scalability and cost-effectiveness of data lakes with the reliability and performance of data warehouses on a single data platform.

Simply said, the data lakehouse is the only data architecture that allows you to store all types of data in your data lake; unstructured, semi-structured, and structured while maintaining the data quality and governance standards of a data warehouse.

One of the key pillars of Data Lakehouses is the open format. The format in which data will be stored is likely the most important decision to make while building a data Lakehouse. It’s inspiring to think that merely changing the format in which data is stored might unlock new features and enhance overall system performance.

Unfortunately, all of the comparisons available between Delta and Iceberg to assist us in making an informed decision are limited to features. That’s why we took the comparison to the performance level by simulating real-world scenarios using the TPC-DS benchmark.

What is TPC-DS?

TPC-DS is a data warehousing benchmark defined by the Transaction Processing Performance Council (TPC). TPC is a non-profit organization founded by the database community in the late 1980s with the goal of developing benchmarks that may be used objectively to test database system performance by simulating real-world scenarios. TPC has had a significant impact on the database industry.

“Decision support” is what the “DS” in TPC-DS stands for. There are 99 queries in total, ranging from simple aggregations to advanced pattern analysis.

Environment setup

In this benchmark we used Delta 1.0 and Iceberg 0.13.0 with the environment components listed in the table below:

As discussed earlier, we used the open sourced TPC-DS benchmark by Delta oss and we extended it to support Iceberg. We registered load performance, which is the time it takes to load data from Parquet format into Delta/iceberg tables. We then registered query performance. Each TPC-DS query was run three times, and the average running time was taken into account.

Benchmark results

  1. Overall performance

After completing the benchmark, the overall performance of loading and querying the tables was in favour of Delta as it was 3.5X faster than Iceberg. It took 1.68 hours to Load data into Delta and perform the TPC-DS queries and it took 5.99 hours to do the same for Iceberg.[chart-1]

chart-1: Load and query performance

2. Load performance

When loading data from Parquet to our intended formats, Delta was 1.3X faster in overall performance than Iceberg.[chart-2]

Chart-2: Load performance

To further analyse the load performance results, we dived into the detailed load results of each table, and noticed that the difference in load time gets wider when the table size gets bigger. For example, when loading the customer table, Delta and Iceberg had practically the same performance. Meanwhile, in the store_sales table, which is one of the biggest tables in the TPC-DS benchmark, Delta was 1.5X times faster than Iceberg.

This shows that Delta is faster and more scalable than Iceberg when loading Data.[chart-3]

Chart-3: Detailed load performance

3. Query performance

When performing the TPC-DS queries, Delta was 4.5X faster in overall performance than Iceberg. It took 1.14 hours to perform all queries on Delta and it took 5.27 hours to do the same on Iceberg.[chart-4]

Chart-4: Query total performance

Iceberg and Delta delivered approximately the same performance in query34, query41, query46 and query68. The difference in those queries was less than 1 second.

However, Delta was faster than Iceberg in all the remaining TPC-DS queries by different margins.

In some queries, like in query72, Delta was 66X faster than Iceberg.

And in the other queries, the difference between Delta and Iceberg ranged between about 1.1 to 24X faster in the favour of Delta.[chart-5]

Chart-5: Detailed query performance

Conclusion:

After running the benchmark, Delta outperformed Iceberg in terms of scalability and performance with unexpected margins.This benchmark was a clear answer for us and our customers about which solution to opt for when building a Lakehouse.

It is also important to state that Iceberg and Delta will keep improving and along with their improvements, we will keep an eye on their performance and share the results with the wider community.

To further analyse and extract your own insights from this benchmark, you can download the full benchmark reports here.

--

--

--

Simplify your data pipelines through simple reusable components [databeans.fr]

Love podcasts or audiobooks? Learn on the go with our new app.

Designing Intelligent Python Dictionaries

Becoming a Data Scientist— A Personal Story

Classify Domains In Seconds With This Classification API!

The $3 trillion data quality opportunity and our investment in Validio

Use These Sites To Extract Data From The Internet

It all Boils Down to the Training Data

Calculate your own Inflation Rate

Python for Art — Fast Neural Style Transfer using TensorFlow 2

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
DataBeans

DataBeans

Simplify your data pipelines through simple reusable components [databeans.fr]

More from Medium

Data modeling in the world of the Modern Data Stack 2.0

The Three Types of Observability Your System Needs

Flaker 2.0 — Fake Snowflake data the easy way

System Design Solutions: When to use Cassandra and when not to