0
votes

I try to mount an Azure Data Lake Storage Gen2 account using a service principal and OAuth 2.0 as explained here:

configs = {
  "fs.azure.account.auth.type": "OAuth",
  "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
  "fs.azure.account.oauth2.client.id": "<application-id>",
  "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
  "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

dbutils.fs.mount(
  source = "abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs
)

The service principal that I use has the Storage Blob Data Contributor role at the storage account level and has also rwx access at the container level.

Anyway, I get this error:

ExecutionError: An error occurred while calling o242.mount.
: HEAD https://<storage-account-name>.dfs.core.windows.net/<file-system-name>?resource=filesystem&timeout=90
StatusCode=403
StatusDescription=This request is not authorized to perform this operation.

I even tried to access it directly using the storage account access key as described here but without success:

spark.conf.set(
  "fs.azure.account.key.<storage-account-name>.dfs.core.windows.net",
  dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>")
)
dbutils.fs.ls("abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/<directory-name>")

The thing is that with the Azure CLI I have no problem interacting with this storage account:

az login --service-principal --username <application-id> --tenant <directory-id>
az storage container list --account-name <storage-account-name> --auth-mode login

Also, no problem using the REST API on my machine but I get a AuthorizationFailure once on the cluster:

from getpass import getpass

import requests

from msal import ConfidentialClientApplication

client_id = "<application-id>"
client_password = getpass()

authority = "https://login.microsoftonline.com/<directory-id>"
scope = ["https://storage.azure.com/.default"]

app = ConfidentialClientApplication(
    client_id, authority=authority, client_credential=client_password
)

tokens = app.acquire_token_for_client(scopes=scope)
headers = {
    "Authorization": "Bearer " + tokens["access_token"],
    "x-ms-version": "2019-07-07" # THIS IS REQUIRED OTHERWISE I GET A 400 RESPONSE
}


endpoint = (
    "https://<account-name>.dfs.core.windows.net/<filesystem>//?action=getAccessControl"
)
response = requests.head(endpoint, headers=headers)

print(response.headers)

The firewall is set to only allow trusted Microsoft services to access the storage account.

Did I entered a black hole or is anybody experiencing the same issue with Databricks ? Is it caused by the ABFS driver ?

1
Please do checkout the answer on MSDN thread: social.msdn.microsoft.com/Forums/en-US/…CHEEKATLAPRADEEP-MSFT
@CHEEKATLAPRADEEP-MSFT Am I missing something from the thread because I already did what's in it ?flappy
Databricks is not (yet) a Trusted Microsoft Services. You may try to add Databricks to the IAM ? Or maybe create a vnet that would include databricks and provide you with an IP that you could whitelist ? Have a look here : databricks.com/blog/2020/02/28/…Axel R.

1 Answers

0
votes

Indeed, the problem was due to the firewall settings. Thank you Axel R!

I was misled by the fact that I also have a ADLS Gen 1 with the same firewall settings and had no problem.

BUT, the devil is in the details. The Gen 1 firewall exceptions allow all Azure services to access the resource. The Gen 2, meanwhile, only allows trusted Azure services.

I hope this can help someone.