0
votes

I'm transforming data within different Databricks notebooks (reading, transforming and writing to/from ADLS). I conected these notebooks within a DataFactory pipeline:

Notebook 1 --> Notebook 2 --> Notebook 3 --> Notebook

I've than created a connection to my Databricks from the DataFactory and added it to my notebook activities. I would like to start a Databricks cluster whenever the pipeline has been triggered. Overall all of this it working fine. But Databricks starts a job cluster for each notebook activity which takes too long and seems unnecessary to me.

Is it possible to start a cluster at the beginning of a pipeline and then shut it down after all notebooks has been completed? Or are there any arguments that it's good to have a job cluster for each activity?

2

2 Answers

2
votes

Currently using same job cluster for multiple notebook activities is not possible.

Two alternative options:

  1. Use interactive cluster
  2. Use interactive cluster and (if cost conscious) have a web activity at the beginning to START the cluster via azure databricks REST endpoint and another web activity at the end after notebook activities to DELETE(TERMINATE) the cluster via REST endpoint

Unfortunately both options use interactive clusters - which are bit expensive compared to job clusters.

0
votes

There is a possible workaround also. You can create and trigger "master" Databricks notebook with job cluster from ADF and it will call your notebooks with appropriate parameters one by one with dbutils.notebook.run() command.

In this way, you will achieve cost savings from job cluster and it will also terminate immediately.

See section "https://towardsdatascience.com/building-a-dynamic-data-pipeline-with-databricks-and-azure-data-factory-5460ce423df5" in this article -> https://towardsdatascience.com/building-a-dynamic-data-pipeline-with-databricks-and-azure-data-factory-5460ce423df5