I have an Azure Databricks (Databricks 6.4 (includes Apache Spark 2.4.5, Scala 2.11) ) Standard cluster configured with Active Directory passthrough to support querying an Azure Data Lake Gen 2 storage account.
The ADLS was mounted via python:
configs = {
"fs.azure.account.auth.type": "CustomAccessToken",
"fs.azure.account.custom.token.provider.class": spark.conf.get("spark.databricks.passthrough.adls.gen2.tokenProviderClassName")
}
dbutils.fs.mount(
source = "abfss://[email protected]/",
mount_point = "/mnt/taxi",
extra_configs = configs)
Using {sparkR} from a databricks notebook returns results.
taxiall <- read.df("/mnt/taxi/yellow",source="parquet")
collect(mean(rollup(taxiall, "vendorID", "puLocationId"), "totalAmount"))
Using {sparklyr} gets a problem with the token.
library(sparklyr)
library(dplyr)
sc <- spark_connect(method = "databricks")
yellowtaxi <- spark_read_parquet(path="/mnt/taxi/yellow",sc=sc)
yellow_taxi %>%
group_by(vendorID, puLocationId) %>%
summarise(avgFare= mean(totalAmount), n= n()) ->
fares
collect(fares)
Error : com.databricks.backend.daemon.data.client.adl.AzureCredentialNotFoundException: Could not find ADLS Gen2 Token
Is something else extra needed to ensure sparklyr works with the credential passthrough?