“With great power comes great responsibility” -Spiderman
Introduction:
The world generates 2.5 quintillion bytes per day. That’s 1,000 petabytes!
So in line with this huge amount of data volumes generated globally, adaptation features and techniques are becoming more and more crucial in the struggle to harness it.
Extracting insights and business decisions from data requires querying it frequently.
So one major factor is the enhancement of queries performance which proved very beneficial business-wise (faster business decisions) and cost-wise (decrease the time resources are being deployed).
For instance, one rewarding feature used to decrease queries time is Data Skipping, however, it’s extremely influenced by the data layout. And as you guessed, ordering huge volumes of data is not cheap nor fast.
In their conquest to make the most of such an expensive operation, Data lakehouses using open-formats to store its data such as Delta lake, adapted an innovative ordering technique called Z-ordering.
To sum up, it all comes up to an equation with two variables: Performance of queries and cost of ordering.
And it’s up to you to “hit the bull’s eye” and answer one crucial question:
How to attain the best query performance with the minimum ordering charges?
Certainly, this is not an easy task.
This seems a bit complicated! Don’t panic. Going on, we will adopt a businesslike example in order to simplify some fairly complex features and techniques.
Bob in the bookstore
Being a reading enthusiast, Bob, a data engineer looking to acquire a side hustle, decided to invest a portion of his savings to buy a bookstore.
He was thrilled when he found one with a very low price, so he made an impulse buy thinking he was fortunate, but little did he know how many complications lie ahead.
It didn’t take Bob much time to start noticing problems.
Bob noticed that each client coming in needs to spend a significant amount of time in search for a specific book to rent. In some cases, people complained that they had to go through the entire store, shelf by shelf, book by book, before finding their desire. Not to mention cases in which their target doesn’t even exist in the store. This really made some customers furious to the point that they stopped coming in.
Hence, something had to be done.
Therefore, reflecting on his days as a data engineer and his knowledge of Delta Lake, Bob thought of a technique called skipping on statistics.
Data skipping based on statistics:
Data skipping is an under the hood feature that is meant to avoid scanning irrelevant data.
It uses file-level statistics (minimum and maximum values for each column) in order to perform skipping at file granularity.Delta Lake stores data into multiple parquet files.
Min-max statistics are collected and persisted automatically in Delta transactional log when data is written into a Delta Lake table.
Delta Lake takes advantage of this information at query time to avoid opening irrelevant files, therefore reducing query time.
Correspondingly, Bob decided to label the shelves.
He extracted the highest and lowest values of the books’ release dates in each shelf, wrote them on some tags and taped them in place.
This way, clients would be able to skip entire shelves in their search just by reading the shelf’s label and deciding whether it contains their target.
Next day, Bob was eager to see results.
However, even after his remarkable effort, Bob was surprised that there were no obvious improvements in the client’s experience. People still have to spend hours navigating his store and complaints are still pouring in like rain.
After further investigations, Bob realized that his failure was due to the fact that he didn’t take into consideration the books’ layout.
In fact, He noticed that his current ordering doesn’t favor any skipping and clients are still obliged to go through an enormous number of shelves when querying for any book.
For example, if someone was searching for a book released in 2001, he would be forced to go through both shelves in the figure above.
⇒ Skipping based on statistics is deeply influenced by data clustering
Going on, Bob knew exactly what to do: He is sorting books by their release dates.
Relentlessly, He worked hard to take down all the books, sort them by their release dates, then put them back in their adequate places.
And the result was beautiful:
Next morning, The flow of people navigating the store was clearly showing improvements, but still, something was off.
As he sat on the counter, Bob decided to investigate deeper into this flow in order to understand the impact of his book’s sorting on the customer’s search time.
He decided to take notes on how much time each client spends from the moment he enters the store to the moment he’s in front of him, signing for his rented book.
Results clearly stated that a fair number of visitors did find their target in just a few minutes. But for most customers, it’s still a long and painful journey exactly as it was before the sorting took place.
And just a few complaints in, Bob put his hands on the problem:
His sorting only enhanced the experience of customers searching for books by their release dates. However, for those with different searching criterias, such as the appropriate age, there were no improvements in the time spent looking around for their target.
While based on release dates, shelves are ordered perfectly. There’s no ordering of any kind at those based on the appropriate age.
In case of linear sorting, the first specified predicate will be strongly favored, clustering its values perfectly. However, going on through predicates from left to right, clustering won’t be as efficient, which will proportionally affect any data skipping based on these columns.
Therefore, for linear sorting, the order in which predicates are stated matters, leading to an impossible choice that will negatively impact the client’s experience every time.
After hours spent contemplating his unfortunate situation, Once again, Bob’s Delta lake knowledge kicked in. He remembered an ordering technique that can prove very efficient in this case called Z-ordering.
Z-Ordering is a technique used to colocate related information in the same set of files.
It maps multi-dimensional points to one-dimensional values in a way that preserves locality.This feature is available in Delta Lake 2.0.0 and above.
Relaying on these bases, Bob concluded that by using the Z-order technique, he would be able to colocate books on shelves relaying on multiple criterias, and most importantly, the order he’d follow won’t matter.
Following this, Bob didn’t waste any more time and started rearranging the store.
He had to take every book off the shelves again, reorder them, then place them back.
It really was a very long-lasting and exhausting labor.
Consequently, he ended up with a satisfying result:
However there’s no perfect ordering, both release dates and appropriate age present statistics with narrow ranges and no serious overlapping.
Going back to his notes, Bob decided to choose a number of previous complaints to study the impact of his new ordering comparatively with his old one.
And to better visualize the skipping effectiveness, He would color skipped shelves Red, and accessed shelves Green.
- The first complaint in the box was of a customer looking for a horror book issued in the year 1992.
-For sorting, 75% of shelves are skipped and only 25% need to be accessed in order to find the target.
-For Z-ordering, 50% of shelves are skipped and 50% are accessed.
- The second complaint was about books for adults (+18)
-For sorting, all shelves need to be accessed in search of the book.
-For Z-ordering, 50% of shelves are skipped and 50% are accessed.
As it’s clearly illustrated, when searching for a book by the same criteria shelves are sorted by, sorting presentes perfect skipping by grouping all potential targets in just one shelf.
But, it loses any skipping powers when searching by criterias different than the one books are sorted by (0% skipping).
As for Z-ordering, in spite of changing the criterias searching by, skipping stayed constant for both cases at 50% which means that clients need only to go through half the shelves to find their target book. Most importantly, no clients will need to go through the pain of navigating the entire store.
Undoubtedly, Bob was ecstatic with the results to the point he forgot all about the pain in his back and his sleep deprivation caused by the huge amount of work he had to do.
As expected, as soon as the store opened its doors, confirmation came along: clients instantly noticed the huge impact of this new arrangement and complaints became compliments.
Three days went by, and for the first time, things were looking up for Bob. Business had never been better.
However, with each day passing by, Bob started noticing that the time spent by customers navigating the shelves is snowballing.
After a thorough inspection of the store, He found out that after borrowing a book, customers didn’t bother to return it to its exact original place.
Additionally, newly added books were hard to find because they did not abide by the ordering.
During inserts, deletes and updates, Delta uses skipping based on statistics to locate the input parquet files that based on their statistics may be touched by the operation.
In the second phase, Delta reads the touched files and writes new files from scratch with the changed rows. During this phase, any previous ordering for both the modified and unmodified rows of the touched files is lost.
Finally, the delta transaction log is used to atomically remove the touched files and add the new ones.
To sum up, inserts, deletes and updates break Z-ordering.
At this point, Bob knew that there was no escape from Z-ordering again.
But, knowing too well the time and effort he’d have to put in during this painful task, he was hell bound on reducing its frequency as little as possible.
And this led him to wonder:
When is it advisable to Z-order again?
Each time he brings in new books to the store?
Maybe he should do it once per day after he closes the doors in order to maintain his store’s well being. Or can he go for a whole week without truly impacting the customer’s experience? Even better for a whole month?
The longer Bob thought about this dilemma, the clearer it became that if he wanted to do it right, there were no easy direct answers.
As usual, when he hit a roadblock, Bob ran back to Delta for help. But after a thorough search and numerous calls to coworkers, he was shocked with the fact that there were no techniques, features or classes available in Delta to help him obtain insights on how well data is clustered and when it becomes advisable to use such an expensive and time consuming operation such as Z-ordering.
Dead end
Stumbling on this issue, Bob decided to strike a deal with two customers who practically lived in the store reading books. He hired them as employees to maintain the well being of the store and in return, they can read as many books as they can free of charge. While he decided that it’s time to step up and develop his own solution for Delta as a pay back for all its previous help.
Z-ordering: take the Guesswork out (part2) link: https://medium.com/@databeans-blogs/delta-z-ordering-take-the-guesswork-out-part2-1bdd03121aec