1
votes

I am using ADLS Gen2, from a Databricks notebook trying to process the file using 'abfss' path. I am able to read parquet files just fine but when I try to load the XML files, I am getting the error the configuration is not found - Configuration property xxx.dfs.core.windows.net not found.

I haven't tried mounting the file but trying to understand if it's a known limitation with XML files, as I am able to read the parquet files just fine.

Here is my XML libraries config com.databricks:spark-xml_2.11:0.9.0

I tried a couple of things per the other articles but still getting the same error.

  • Added a new scope to see if it's a scope issue in the Databricks Workspace.
  • Tried adding configuration spark.conf.set("fs.azure.account.key.xxxxx.dfs.core.windows.net", "xxxx==")
df = spark.read.format("xml")
 .option("rootTag","BookArticle")
 .option("inferSchema", "true")
 .option("error_bad_lines",True)
 .option("mode", "DROPMALFORMED")
 .load(abfsssourcename)   ##abfsssourcename is the path of the source file name

Exception Details: Py4JJavaError: An error occurred while calling o1113.load. 
Configuration property xxxx.dfs.core.windows.net not found. at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:392) at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1008) at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:151) at shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:106) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370) at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:500) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.setInputPaths(FileInputFormat.java:469) at org.apache.spark.SparkContext$$anonfun$newAPIHadoopFile$2.apply(SparkContext.scala:1281) at org.apache.spark.SparkContext$$anonfun$newAPIHadoopFile$2.apply(SparkContext.scala:1269) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.SparkContext.withScope(SparkContext.scala:820) at org.apache.spark.SparkContext.newAPIHadoopFile(SparkContext.scala:1269) at com.databricks.spark.xml.util.XmlFile$.withCharset(XmlFile.scala:46) at com.databricks.spark.xml.DefaultSource$$anonfun$createRelation$1.apply(DefaultSource.scala:71) at com.databricks.spark.xml.DefaultSource$$anonfun$createRelation$1.apply(DefaultSource.scala:71) at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:43) at com.databricks.spark.xml.XmlRelation$$anonfun$1.apply(XmlRelation.scala:42) at scala.Option.getOrElse(Option.scala:121) at com.databricks.spark.xml.XmlRelation.<init>(XmlRelation.scala:41) at com.databricks.spark.xml.XmlRelation$.apply(XmlRelation.scala:29) at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:74) at com.databricks.spark.xml.DefaultSource.createRelation(DefaultSource.scala:52) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:350) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:311) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:297) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:214) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
1
The package seems using RDD API to read xml file, so we need to save the key in Hadoop configuration options. Please update code as spark._jsc.hadoopConfiguration().set("fs.azure.account.key.xxxxx.dfs.core.windows.net", "xxxx=="). For more details, please refer to docs.databricks.com/data/data-sources/azure/…Jim Xu
@JimXu, this line of code fixed the issue. I used mount as alternative solution, but your response is the answer to my question :)Satya Azure
Hi. I summarize my suggestions as a solution. Since it is useful for you, could you please accept it as an answer? It may help more persons who have similar issue.Jim Xu
Thanks for summarizing, accepted it as answer!Satya Azure

1 Answers

3
votes

I summarize the solution as below.

The package com.databricks:spark-xml seems using RDD API to read xml file. When we use using the RDD API to access Azure Data Lake Storage Gen2, wecannot access Hadoop configuration options set using spark.conf.set(...). So we should update the code as spark._jsc.hadoopConfiguration().set("fs.azure.account.key.xxxxx.dfs.core.windows.net", "xxxx=="). For more details, please refer to here.

Besides, you aslo can mount Azure Data Lake Storage Gen2 as file system in Azure databricks.