1
votes

I am parsing the files from Azure blob storage using spark in Azure databricks. The blob is mounted as dbfs. Right now I am doing it in a notebook, using hardcoded file name(dbfs file name). But I want to trigger the notebook with the new dbfs name whenever a new blob is created. I checked using Azure functions I can get a blob trigger. Can I start a databricks notebook/job from Azure functions? The operations on blob takes quite some time. Is it advisable to use azure functions in such cases. Or is there some other way to achieve this.

2
Check Azure Data Factory. You can schedule a trigger whenever a new file is added to blob storage. The ADF will pass this file name as a parameter to the Databricks notebook. You can check widgets in Dataricks which will get this file name and use it in the notebook.Partha Deb
I found something called Databricks Streaming. I am investigating that. Does anyone has any thoughts about it. Can it be used as well. So far I have not been able to find if I could execute my own function per file to parse. All the examples are based on CSV files.saras

2 Answers

2
votes

As Parth Deb says, use azure datafactory will be easier for your requirement.

You just need to create a trigger of your pipeline and then create a event trigger based on 'blob created' to trigger the databricks activity. You just need to pass parameters.

This is a built-in function of the factory, you can check the documentation:

https://docs.microsoft.com/en-us/azure/data-factory/concepts-pipelines-activities

https://docs.microsoft.com/en-us/azure/data-factory/transform-data-databricks-notebook

https://docs.microsoft.com/en-us/azure/data-factory/how-to-expression-language-functions

You can look at the above document. In the end, you basically only need some mouse operations.

1
votes

I ended up using ADF. I created a new pipeline with Blob triggers that were triggered based on the file names.