2
votes

I'm trying to connect from a local Spark job to my ADLS Gen 2 data lake to read some Databricks delta tables, which I've previously stored through a Databricks Notebook, but I'm getting a very weird exception, which I can't sort out:

Exception in thread "main" java.io.IOException: There is no primary group for UGI <xxx> (auth:SIMPLE)
    at org.apache.hadoop.security.UserGroupInformation.getPrimaryGroupName(UserGroupInformation.java:1455)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:136)
    at org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:108)
    at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3303)
    at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
    at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
    at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
    at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
    at org.apache.hadoop.fs.Path.getFileSystem(Path.java:361)
    at org.apache.spark.sql.delta.DeltaTableUtils$.findDeltaTableRoot(DeltaTable.scala:94)

Searching around, I've not found many hints on this. One, which I tried was to pass the config "spark.hadoop.hive.server2.enable.doAs", "false", but it didn't help out.

I'm using io.delta 0.3.0, Spark 2.4.2_2.12 and azure-hadoop 3.2.0. I can connect to my Gen 2 account without issues through an Azure Databricks Cluster/ Notebook.

I'm using code like the folling:

 try(final SparkSession spark = SparkSession.builder().appName("DeltaLake").master("local[*]").getOrCreate()) {
            //spark.conf().set("spark.hadoop.hive.server2.enable.doAs", "false");
            spark.conf().set("fs.azure.account.key.stratify.dfs.core.windows.net", "my gen 2 key");
            spark.read().format("delta").load("abfss://[email protected]/Test");
}
1

1 Answers

0
votes

ADLS Gen2 requires Hadoop 3.2, Spark 3.0.0, and Delta Lake 0.7.0. The requirements are documented in https://docs.delta.io/latest/delta-storage.html#azure-data-lake-storage-gen2

ADLS Gen2 Hadoop connector is only available in Hadoop 3.2.0, and Spark 3.0.0 is the first Spark version that supports Hadoop 3.2.

Databricks Runtime 6.x and older versions runs Hadoop 2.7 and Spark 2.4 but ADLS Gen2 Hadoop connector is backported to this old Hadoop version internally. That's why Delta Lake can work in Databricks without upgrading to Spark 3.0.0.