step by step guide
Introduction:
Lighthouse is a library developed by DataBeans to optimize Lakehouse performance and reduce its total cost ownership. It is designed to monitor the health of the Lakehouse tables from a data layout perspective and provide valuable insights on how well data is clustered.
Prerequisites:
- Databricks account (community edition would do just fine)
Setup:
To setup your environment in order to extract the clustering metrics of your delta tables, follow the next steps:
1- Install the Lighthouse Maven library on your cluster:
In your databricks account, navigate to your “compute” window in your left sidebar, then click on “Libraries” and “Install new” (Figure-1):
In the “Install library” window, select “Maven” as your library source, and set the coordinates to “io.github.Databeans:lighthouse_2.12:0.1.0” then “install” (Figure-2):
Now, The lighthouse Maven library should be successfully installed on your cluster (Figure-3):
2- Import the lighthouse’s databricks notebook to your databricks workspace:
Download the following notebook (Figure-4):
Import the downloaded notebook to your databricks workspace (Figure-5):
Select the downloaded file “DeltaClusteringMetrics.scala” from your local machine then click “Import” (Figure-6):
Make sure to copy the file path of the uploaded notebook for future use (Figure-7):
In your notebook, attach the cluster on which the lighthouse library is installed (Figure-8)
Create a new cell and use the magic command “%run” to run “DeltaClusteringMetrics.scala” notebook (Figure-9) (path already copied in a previous step (Figure-7)) :
Lighthouse in action:
With these simple setup steps completed, our environment is ready. Now, we’re able to use the lighthouse library to extract the clustering metrics of any delta table as follow:
1- Create a Delta table for testing purposes
Create a delta table for testing purposes (Figure-10):
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType
spark
.range(1, 5, 1)
.toDF()
.withColumn(“id”, col(“id”).cast(IntegerType))
.withColumn(“keys”, lit(1))
.withColumn(“values”, col(“id”) * 3)
.write.mode(“overwrite”)
.format(“delta”)
.saveAsTable(“deltaTable”)
Let’s visualize our newly created delta table (Figure-11):
%sql
select * from deltaTable
2- Extract the clustering metrics of our Delta table
Let’s extract the clustering metrics for the column “id” of our delta table: (Figure-12)
val clusteringMetric = DeltaClusteringMetrics
.forName(“deltaTable”, spark)
.computeForColumn(“id”)
display(clusteringMetric)
Finally, if you are interested, here are some useful links:
- Lighthouse introduction: https://www.linkedin.com/feed/update/urn:li:activity:7074148921526099968
- Notebook hosting the code for the previous exemple: https://github.com/Databeans/lighthouse_databricks_demo
- Lighthouse Github repository: https://github.com/Databeans/lighthouse
- Lighthouse Maven repository: https://mvnrepository.com/artifact/io.github.Databeans/lighthouse
- Lighthouse use-case Demo: https://medium.com/@databeans-blogs/delta-z-ordering-take-the-guesswork-out-part2-1bdd03121aec
- Databeans on LinkedIn: https://www.linkedin.com/company/databeans/