step by step guide
Lighthouse is a library developed by DataBeans to optimize Lakehouse performance and reduce its total cost ownership. It is designed to monitor the health of the Lakehouse tables from a data layout perspective and provide valuable insights on how well data is clustered.
- Databricks account (community edition would do just fine)
To setup your environment in order to extract the clustering metrics of your delta tables, follow the next steps:
1- Install the Lighthouse Maven library on your cluster:
In your databricks account, navigate to your “compute” window in your left sidebar, then click on “Libraries” and “Install new” (Figure-1):
In the “Install library” window, select “Maven” as your library source, and set the coordinates to “io.github.Databeans:lighthouse_2.12:0.1.0” then “install” (Figure-2):
Now, The lighthouse Maven library should be successfully installed on your cluster (Figure-3):
2- Import the lighthouse’s databricks notebook to your databricks workspace:
Download the following notebook (Figure-4):
Import the downloaded notebook to your databricks workspace (Figure-5):
Select the downloaded file “DeltaClusteringMetrics.scala” from your local machine then click “Import” (Figure-6):
Make sure to copy the file path of the uploaded notebook for future use (Figure-7):
In your notebook, attach the cluster on which the lighthouse library is installed (Figure-8)
Create a new cell and use the magic command “%run” to run “DeltaClusteringMetrics.scala” notebook (Figure-9) (path already copied in a previous step (Figure-7)) :
Lighthouse in action:
With these simple setup steps completed, our environment is ready. Now, we’re able to use the lighthouse library to extract the clustering metrics of any delta table as follow:
1- Create a Delta table for testing purposes
Create a delta table for testing purposes (Figure-10):
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType
.range(1, 5, 1)
.withColumn(“id”, col(“id”).cast(IntegerType))
.withColumn(“keys”, lit(1))
.withColumn(“values”, col(“id”) * 3)
Let’s visualize our newly created delta table (Figure-11):
select * from deltaTable
2- Extract the clustering metrics of our Delta table
Let’s extract the clustering metrics for the column “id” of our delta table: (Figure-12)
val clusteringMetric = DeltaClusteringMetrics
.forName(“deltaTable”, spark)
Finally, if you are interested, here are some useful links:
- Lighthouse introduction:
- Notebook hosting the code for the previous exemple:
- Lighthouse Github repository:
- Lighthouse Maven repository:
- Lighthouse use-case Demo:
- Databeans on LinkedIn: