Lighthouse setup on Databricks

3 min readJun 21, 2023

step by step guide

Introduction:

Lighthouse is a library developed by DataBeans to optimize Lakehouse performance and reduce its total cost ownership. It is designed to monitor the health of the Lakehouse tables from a data layout perspective and provide valuable insights on how well data is clustered.

Prerequisites:

Databricks account (community edition would do just fine)

Setup:

To setup your environment in order to extract the clustering metrics of your delta tables, follow the next steps:

1- Install the Lighthouse Maven library on your cluster:

In your databricks account, navigate to your “compute” window in your left sidebar, then click on “Libraries” and “Install new” (Figure-1):

In the “Install library” window, select “Maven” as your library source, and set the coordinates to “io.github.Databeans:lighthouse_2.12:0.1.0” then “install” (Figure-2):

Now, The lighthouse Maven library should be successfully installed on your cluster (Figure-3):

2- Import the lighthouse’s databricks notebook to your databricks workspace:

Download the following notebook (Figure-4):

Import the downloaded notebook to your databricks workspace (Figure-5):

Select the downloaded file “DeltaClusteringMetrics.scala” from your local machine then click “Import” (Figure-6):

Make sure to copy the file path of the uploaded notebook for future use (Figure-7):

In your notebook, attach the cluster on which the lighthouse library is installed (Figure-8)

Figure-8

Create a new cell and use the magic command “%run” to run “DeltaClusteringMetrics.scala” notebook (Figure-9) (path already copied in a previous step (Figure-7)) :

Lighthouse in action:

With these simple setup steps completed, our environment is ready. Now, we’re able to use the lighthouse library to extract the clustering metrics of any delta table as follow:

1- Create a Delta table for testing purposes

Create a delta table for testing purposes (Figure-10):

import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.IntegerType
spark
.range(1, 5, 1)
.toDF()
.withColumn(“id”, col(“id”).cast(IntegerType))
.withColumn(“keys”, lit(1))
.withColumn(“values”, col(“id”) * 3)
.write.mode(“overwrite”)
.format(“delta”)
.saveAsTable(“deltaTable”)