How to access XML file from Azure Data Lake Gen2 and transform it into data-frame in Azure Databricks?

Question

we need to access the XML file located in Azure Data Lake Gen2 and Transform it into a dataframe as shown below.

Sample XML data:

<SOAP-ENV:Envelope
   xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/">

<SOAP-ENV:Body>
           <ns2:getProjectsResponse
               xmlns:ns2="http://www.logic8.com/eq/webservices/generated">
               <ns2:Project>
                   <ns2:fileName>P10001</ns2:fileName>
                   <ns2:alias>project1</ns2:alias>
               </ns2:Project>
               <ns2:Project>
                   <ns2:fileName>P10002</ns2:fileName>
                   <ns2:alias>project2</ns2:alias>
               </ns2:Project>
       <ns2:Project>
                   <ns2:fileName>P10003</ns2:fileName>
                   <ns2:alias>project3</ns2:alias>
               </ns2:Project>
           </ns2:getProjectsResponse>
       </SOAP-ENV:Body>
   </SOAP-ENV:Envelope>

Expected Dataframe output:

Can anyone help me on this.

Have a look at spark-xml library : github.com/databricks/spark-xml It can be easily installed on Databricks and allows you to manipulate xml schemas. — Axel R.

Leon Yue Leon Yue · Accepted Answer · 2020-02-17T04:33:01

Firstly, you need to learn read data from Azure Data Lake Gen2 to Azure databricks.

There are many tutorials you can learn from:

Databricks: Importing data from a Blob storage. This blogpost is about importing data from a Blob storage to Azure databricks.
Databricks Azure Blob Storage: This article explains how to access Azure Blob storage by mounting storage using DBFS or directly using APIs.

Secondly, about the xml data type, you need to use the use the databricks spark-xml library which @Axel R has provided in comment.

Import the spark-xml library into your workspace https://docs.databricks.com/user-guide/libraries.html#create-a-library (search spark-xml in the maven/spark package section and import it)
Attach the library to your cluster https://docs.databricks.com/user-guide/libraries.html#attach-a-library-to-a-cluster
Use the following code in your notebook to read the xml file, where "note" is the root of the xml file.

xmldata = spark.read.format('xml').option("rootTag","note").load('dbfs:/mnt/mydatafolder/xmls/note.xml')

Please reference：How can I read a XML file Azure Databricks Spark.

Combine these documents, I think you can figure out you problem. I don't know much about Azure databricks, I'm sorry that I can't test for you.

Hope this helps.

How to access XML file from Azure Data Lake Gen2 and transform it into data-frame in Azure Databricks?

1 Answers