Lighthouse setup on Databricks

DataBeans
3 min readJun 21, 2023

--

step by step guide

Introduction:

Lighthouse is a library developed by DataBeans to optimize Lakehouse performance and reduce its total cost ownership. It is designed to monitor the health of the Lakehouse tables from a data layout perspective and provide valuable insights on how well data is clustered.

Prerequisites:

  • Databricks account (community edition would do just fine)

Setup:

To setup your environment in order to extract the clustering metrics of your delta tables, follow the next steps:

1- Install the Lighthouse Maven library on your cluster:

In your databricks account, navigate to your “compute” window in your left sidebar, then click on “Libraries” and “Install new” (Figure-1):

Figure-1

In the “Install library” window, select “Maven” as your library source, and set the coordinates to “io.github.Databeans:lighthouse_2.12:0.1.0” then “install” (Figure-2):

Figure-2

Now, The lighthouse Maven library should be successfully installed on your cluster (Figure-3):

Figure-3

2- Import the lighthouse’s databricks notebook to your databricks workspace:

Download the following notebook (Figure-4):

Figure-4

Import the downloaded notebook to your databricks workspace (Figure-5):

Figure-5

Select the downloaded file “DeltaClusteringMetrics.scala” from your local machine then click “Import” (Figure-6):

Figure-6

Make sure to copy the file path of the uploaded notebook for future use (Figure-7):

Figure-7

In your notebook, attach the cluster on which the lighthouse library is installed (Figure-8)

Figure-8

Create a new cell and use the magic command “%run” to run “DeltaClusteringMetrics.scala” notebook (Figure-9) (path already copied in a previous step (Figure-7)) :

Figure-9

Lighthouse in action:

With these simple setup steps completed, our environment is ready. Now, we’re able to use the lighthouse library to extract the clustering metrics of any delta table as follow:

1- Create a Delta table for testing purposes

Create a delta table for testing purposes (Figure-10):

import org.apache.spark.sql.functions._

import org.apache.spark.sql.types.IntegerType

spark

.range(1, 5, 1)

.toDF()

.withColumn(“id”, col(“id”).cast(IntegerType))

.withColumn(“keys”, lit(1))

.withColumn(“values”, col(“id”) * 3)

.write.mode(“overwrite”)

.format(“delta”)

.saveAsTable(“deltaTable”)

Figure-10

Let’s visualize our newly created delta table (Figure-11):

%sql

select * from deltaTable

Figure-11

2- Extract the clustering metrics of our Delta table

Let’s extract the clustering metrics for the column “id” of our delta table: (Figure-12)

val clusteringMetric = DeltaClusteringMetrics

.forName(“deltaTable”, spark)

.computeForColumn(“id”)

display(clusteringMetric)

Figure-12

Finally, if you are interested, here are some useful links:

--

--

DataBeans

Simplify your data pipelines through simple reusable components [databeans.fr]