Databricks job cluster per pipeline not per notebook activity

Question

I'm transforming data within different Databricks notebooks (reading, transforming and writing to/from ADLS). I conected these notebooks within a DataFactory pipeline:

Notebook 1 --> Notebook 2 --> Notebook 3 --> Notebook

I've than created a connection to my Databricks from the DataFactory and added it to my notebook activities. I would like to start a Databricks cluster whenever the pipeline has been triggered. Overall all of this it working fine. But Databricks starts a job cluster for each notebook activity which takes too long and seems unnecessary to me.

Is it possible to start a cluster at the beginning of a pipeline and then shut it down after all notebooks has been completed? Or are there any arguments that it's good to have a job cluster for each activity?

databash databash · Accepted Answer · 2019-02-22T11:42:41

Currently using same job cluster for multiple notebook activities is not possible.

Two alternative options:

Use interactive cluster
Use interactive cluster and (if cost conscious) have a web activity at the beginning to START the cluster via azure databricks REST endpoint and another web activity at the end after notebook activities to DELETE(TERMINATE) the cluster via REST endpoint

Unfortunately both options use interactive clusters - which are bit expensive compared to job clusters.

Databricks job cluster per pipeline not per notebook activity

2 Answers