0
votes

I'm running into some frustration among the teams at our small company surrounding the use of Azure Data Lake Storage Gen 2 as the backend for our delta tables in Azure Databricks (I'm new at this company and to Databricks, so anything I may explain was decided before my time and I realize some of it may be questionable, but not looking for perspectives on that).

Essentially, the engineering team is building data ingestion pipelines (as python files, not notebooks) that run on and are scheduled by Azure Databricks (Jobs API). This means the pipelines must be able to access the ADLS Gen 2 storage resource - and so we authenticate directly using a Service Principal (SPN) and OAuth 2.0, as described in this Microsoft doc, setting the following configs via spark.conf.set():

fs.azure.account.auth.type.[STORAGE_ACCT].dfs.core.windows.net
fs.azure.account.oauth.provider.type.[STORAGE_ACCT].dfs.core.windows.net
fs.azure.account.oauth2.client.id.[STORAGE_ACCT].dfs.core.windows.net
fs.azure.account.oauth2.client.secret.[STORAGE_ACCT].dfs.core.windows.net
fs.azure.account.oauth2.client.endpoint.[STORAGE_ACCT].dfs.core.windows.net

Given that engineering is working with this for our pipelines, and we use different storage accounts for dev/prod, the codebase detects for which environment we're running in (dev or prod) and sets the appropriate storage account configs. Perhaps unconventional, but no problems on our side.

The issue: Our data science team is running on Databricks Notebooks, and also require access to the data tables backed by ADLS Gen 2 - so, they must authenticate also. However, their code is not engineering code, so they do not account for the change in environment. Thus they are frustrated that on promotion to production, they must make a small tweak which will allow it to work in prod, but then does not work in dev, as these environments are in their own VNETs and have their own storage accounts.

The Ask: How do we securely allow table access without sacrificing access controls, and without having to set these configs in the notebooks/codebase? Is this even possible with Databricks?

If other teams are using the above method, how do they swap container on promotion to production without breaking the code in dev? Is it just a matter of telling the DS team they have to use our logic?

What I've tried:

  • ADLS Credential Passthrough - since we schedule their models to run with Databricks Jobs API - this has a hard limitation on credential passthrough, so cannot use
  • Mounting ADLS - since this gives anyone with access to the cluster access to the ADLS, team is against this approach
  • Cluster config (UI) - we use ephemeral clusters for jobs, but since this would list any retrieved secrets in plaintext, all are against this approach

For now, I've implemented engineering teams logic in their notebook for detecting the environment, but I cannot stress enough how livid they are that engineering code is implemented in their notebook.

I understand there may not be any straight shot answer, but any help would be appreciated.

1

1 Answers

2
votes

You are right, there is not a single answer - but a tradeoff choice. The best writeup of all available options right now is here, and it explains in great detail 6 patterns:

  1. Access via Service Principal

  2. Multiple workspaces — permission by workspace

  3. AAD Credential passthrough

  4. Cluster scoped Service Principal

  5. Session scoped Service Principal

  6. Databricks Table Access Control

You are right now using session-scoped service principals and paying the administrative overhead cost of this.

I'm a little bit confused on your workflow between teams and overall infrastructure, but few options that I'd think about first:

  • parametrize notebooks using widgets to choose storage account and keep secret naming consistent between environments (I assume you have dev and prd backed by different KeyVaults) + pull out as much boilerplate logic as possible to separate , reusable notebooks
  • using passthrough mount in dev for data science team and service principal mount with the same name in production (assuming production is much more limited and controlled environment)
  • splitting off data scientists to separate workspaces with workspace-wide mounts