0
votes

I have an Azure Databricks (Databricks 6.4 (includes Apache Spark 2.4.5, Scala 2.11) ) Standard cluster configured with Active Directory passthrough to support querying an Azure Data Lake Gen 2 storage account.

The ADLS was mounted via python:

configs = {
  "fs.azure.account.auth.type": "CustomAccessToken",
  "fs.azure.account.custom.token.provider.class":   spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName")
}


dbutils.fs.mount(
  source = "abfss://[email protected]/",
  mount_point = "/mnt/taxi",
  extra_configs = configs)

Using {sparkR} from a databricks notebook returns results.

taxiall <- read.df("/mnt/taxi/yellow",source="parquet")
collect(mean(rollup(taxiall, "vendorID", "puLocationId"), "totalAmount"))

Using {sparklyr} gets a problem with the token.

library(sparklyr)
library(dplyr)
sc <- spark_connect(method = "databricks")
yellowtaxi <- spark_read_parquet(path="/mnt/taxi/yellow",sc=sc)
yellow_taxi %>%
  group_by(vendorID, puLocationId) %>%
  summarise(avgFare= mean(totalAmount), n= n()) ->
  fares

collect(fares)

Error : com.databricks.backend.daemon.data.client.adl.AzureCredentialNotFoundException: Could not find ADLS Gen2 Token

Is something else extra needed to ensure sparklyr works with the credential passthrough?

2
Based on finding a MSFT doc re: ADLSg1, suggesting it's a sparklyr bug/limitation I've raised it on github github.com/sparklyr/sparklyr/issues/2342Steph Locke

2 Answers

0
votes

In order to use AD passthrough you have to use a cluster either in single user mode for scala or Python and SQL only for multi-user - unfortunately it doesn't say anything about R and may not be supported at all!

You could try the high concurrency cluster and see if it just works. This is all down to the fact the processes for each user need to be isolated in order for it to be secure. Because R is an interop on top of the JVM it could be supported in the same way the Python is supported, though the documentation doesn't say so and it doesn't look enabled in the auto configuration when you check this option in the cluster configuration. It might be they just haven't got round to it because the overwhelming majority of users now use python on this platform.

It's covered in detail here - however not the best written documentation.

https://docs.microsoft.com/en-us/azure/databricks/security/credential-passthrough/adls-passthrough

-1
votes

Note: Mounting an Azure Data Lake Storage Gen2 is supported only using OAuth credentials. Mounting with an account access key is not supported.

Cause: The spark_read_csv function in Sparklyr is not able to extract the ADLS token to enable authentication and read data.

Solution: A workaround is to use an Azure application id, application key, and directory id to mount the ADLS location in DBFS.

**Mount Azure Data Lake Storage Gen2 filesystem: **

To mount an Azure Data Lake Storage Gen2 filesystem or a folder inside it, use the following command:

Scala code:

val configs = Map(
  "fs.azure.account.auth.type" -> "OAuth",
  "fs.azure.account.oauth.provider.type" -> "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
  "fs.azure.account.oauth2.client.id" -> "<application-id>",
  "fs.azure.account.oauth2.client.secret" -> dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
  "fs.azure.account.oauth2.client.endpoint" -> "https://login.microsoftonline.com/<directory-id>/oauth2/token")

// Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/",
  mountPoint = "/mnt/<mount-name>",
  extraConfigs = configs)

Python Code:

configs = {"fs.azure.account.auth.type": "OAuth",
           "fs.azure.account.oauth.provider.type": "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider",
           "fs.azure.account.oauth2.client.id": "<application-id>",
           "fs.azure.account.oauth2.client.secret": dbutils.secrets.get(scope = "<scope-name>", key = "<key-name-for-service-credential>"),
           "fs.azure.account.oauth2.client.endpoint": "https://login.microsoftonline.com/<directory-id>/oauth2/token"}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
  source = "abfss://<file-system-name>@<storage-account-name>.dfs.core.windows.net/",
  mount_point = "/mnt/<mount-name>",
  extra_configs = configs)

Reference: Error when reading data from ADLS Gen1/gen2 with Sparklyr and "Mount an Azure Data Lake Storage Gen2 account using a service principal and OAuth 2.0".