Introduction
Last time we left Bob in a very Complicated situation, he was facing too many problems with too little time to solve them (Part-1). Bob decided to take a different approach to overcome those challenges.
In order to work smarter, Bob needs to go through all the challenges in the pipeline and check which solution, if any, would provide what he needs.
He must tackle his challenges in a new and more efficient way, preferably without making major changes to his current architecture.
Bob started asking his senior data engineer colleagues about any similar challenges they faced and searched online for possible solutions and one name kept popping up: Delta lake. He focused on this solution to check if it really solves the challenges he faces on a regular basis.
Delta Lake as a potential Solution
Delta Lake is an open-source storage layer that brings data reliability to data lakes. It provides ACID transactions. It runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
Challenge 1: Reproducibility
One of the first challenges that Bob faced was the ability to create reproducible pipelines.
Since computation could fail when running in distributed systems, Bob has found himself under no obligation of rerunning jobs or cleaning partial writes thanks to Delta Lake ACID guarantees.
Moreover, when Business logic changes, Bob has a new trick up his sleeve which is DML support, now he can efficiently UPSERT or DELETE data on Delta tables.
Delta Lake simplifies the data pipelines with flexible UPSERT support as multiple matched and not matched clauses with different actions are granted. UPDATE and DELETE actions could be used for matched clauses, and only INSERT could be used for not matched clauses.
After the small triumph that DML support and ACID guarantees provided our engineer, Bob was struck by a new challenge which required recovering from serious faulty inserts. In order to solve this problem with minimal overhead, Bob used the RESTORE feature that Delta Lake provides. This way he can access and revert to earlier states of the table.
Data scientists in Bob’s team needed to compare different machine learning models. In order to have an apple to apple comparison, the same data needs to be used to build these different models. However, data keeps changing as ingestion is a continuous process ,so the time travel feature was Bob’s best choice to deliver the same version of the data to train all the models.
Challenge 2: Concurrency
While Bob was roaming the Data egineer’s natural habitat i.e stackoverflow, he found a user that claimed that there are exceptions thrown at him from time to time when concurrently updating Delta tables.
Bob checked the type of exceptions that were thrown and found out that it was actually due to the optimistic concurrency control which provides transactional guarantees between writes without impacting performance. Under this mechanism, writes operate in three stages:
1- Read: it reads (if needed) the latest available version of the table to identify which files need to be modified.
2-Write: it stages all the changes by writing new data files that are ready to be committed.
3-Validate and commit: Before committing the changes, it checks whether the proposed changes conflict with any other changes that may have been concurrently committed since the snapshot that was read. If there are no conflicts, all the staged changes are committed as a new versioned snapshot, and the write operation succeeds. However, if there are conflicts, the transaction fails with a concurrency exception rather than corrupting the table as would happen with Parquet or ORC.
So it turns out that Bob saved the day, again, by explaining the limitations of Delta Lake’s optimistic concurrency, which are primarily safeguards to avoid corrupting the tables. And now, Data engineers only need to catch concurrency exceptions and retry their related jobs till they succeed. [Figure-1]
Challenge 3: Many small files
Now that Bob’s team is starting to become more at ease with Delta Lake, the Many Small Files challenge was tackled in hope of finding a simpler solution for it.
To solve this problem, Bob can compact a table by using the OPTIMIZE function which reads the Delta table as a dataframe and rewrites it, resulting in a smaller number of files. He can specify the partition he wants to compact.
The compaction is done by Delta Lake in an optimized way, so Bob doesn’t have to specify the target number of files after compaction. [Figure-2]
As a result of the compaction process we get larger files optimized for analytics and we improved read performance. The compaction reduced the overhead for opening and closing data files and made metadata operations faster.
Challenge 4: Data Quality
The ingestion team contacted Bob to inform him that the data source was upgraded and enriched with more data resulting in a change in the schema and that it needs to be processed within the same delta Lake pipeline. Bob instantly knew that the solution was to enable schema evolution, which is activated by adding the .option(‘mergeSchema’, ‘true’) to your .write or .writeStream Spark command.
By default, Delta offers schema enforcement, also known as schema validation, which is a safeguard in Delta Lake that ensures data quality by rejecting writes to a table that do not match the table’s schema. This is called a schema-on-write data-handling strategy. With this strategy, the following error message is thrown whenever a schema change is detected.
Challenge 5: GDPR: right to be forgotten
To comply with data privacy regulations such as GDPR Bob needs to be able to delete data about a specific user on demand. He must have an efficient way to delete or correct data about individual users.
With Delta Lake DML support Bob can reliably change or erase information. He can modify records with transactional deletions and upserts easily, ensuring that the right to be forgotten is granted.[Figure-3]
It is also worth mentioning that the DELETE command removes the data from the latest version of the Delta table but does not remove it from the physical storage until the old versions are explicitly vacuumed.
In order to permanently delete data from a table, you can simply use the VACUUM command which deletes data that is older than the retention period permanently.
The ability to time travel back to a version older than the retention period is lost after running vacuum.
Migration cost
Delta Lake is offering a solution to all of Bob’s problems, now it’s only a question of migration cost to convince his superiors to migrate to Delta.[Figure-4]
Migrating from non-Delta Lake workloads to Delta Lake is a low-cost operation; it doesn’t need any hardware modification. Delta Lake runs on top of your existing data lake. Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark batch, streaming source and sink.
Converting to Delta tables:
- For Parquet : Convert files to Delta Lake format and create a Delta table:
- for other formats (example csv):
Converting workloads:
When converting workloads from data lakes to delta lakes, only minimal changes must be performed.
The code is very similar which makes the changes easy to perform.
- code in Parquet:
- code in Delta Lake:
Conclusion
Migrating from non-Delta Lake workloads to Delta Lake seems to solve Data Lake limitations with ease. Few lines of code are enough in most cases. But Delta lake doesn’t come without its own challenges.
We will discuss some of them in the upcoming blogs alongside new interesting performance tuning projects.