0
votes

Disclaimer: Not a code query, but directly related to it.

I find it difficult in Databricks to handle such scenarios where there's no shell prompt; just the notebooks. I have two clusters on Azure dev & prod. The database & tables can be accessed via Databricks Notebooks of separate environments.

The problem arises when I want to:

  1. Query data in dev, but from prod environment & vice-versa. On a sql prompt, it just seems impossible to achieve this.
  2. If I want to populate dev table from prod table; there's no way to establish a connection from within the dev notebook to query the table of prod environment.

The workaround I've established for now to copy the prod data into dev is:

  • Download full dump from production in csv in my local machine.
  • Upload to DBFS in dev environment.
  • Create temp table/directly insert the csv in the dev table.

Any comments on how I remove this download-upload process & query prod directly from dev notebook?

1

1 Answers

0
votes

DBFS root is not really a production-grade solution, it's recommended that you always mount an external storage (e.g. Azure Storage - blob or ADLS Gen2)and use it to store your tables.

If you use external storage the problem becomes quite simple - all you have to do is mount the production storage on the dev cluster and you can access it as tables can be defined both over root dbfs and mounted data sources. So you can have a notebook that copies data from one to the other (and hopefully does all of the data anonymization / sampling that you need). You can also setup a more explicit process for that using Azure Data Factory, in most cases using only simple copy activity.