1
votes

As one of the problem statement I am working on parsing XML data using PySpark.

Below is the sample data -

<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<component>Engineering

  <headers>
    <header>
      <name>Date</name>
      <value>05/05/2021 14:50:03</value>
    </header>
    <header>
      <name>StdName</name>
      <value>CoreEngineering</value>
    </header>
    <header>
      <name>ID</name>
      <value>12432AA</value>
    </header>
    <header>
      <name>DeviceType</name>
      <value>EngineGear</value>
    </header>
  </headers>
  
   <headers>
    <header>
      <name>Date</name>
      <value>05/05/2021 14:59:13</value>
    </header>
    <header>
      <name>StdName</name>
      <value>CoreEngineering</value>
    </header>
    <header>
      <name>ID</name>
      <value>98344AA</value>
    </header>
    <header>
      <name>DeviceType</name>
      <value>EngineExhaust</value>
    </header>
  </headers>
  
  
 </component>

While parsing this xml file in databricks using pyspark I am using below logic -

timestamp = datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S')

df = spark.read \
    .format("com.databricks.spark.xml") \
    .option("rootTag", "headers") \
    .option("rowTag", "header") \
    .load("/mnt/test/sourcedata/sample.xml") \
    .withColumn("processeddatetime",unix_timestamp(lit(timestamp),'yyyy-MM-dd HH:mm:ss').cast("timestamp"))

And the output received is :

enter image description here

However, wondering how I can transpose the data received since that is required for further data transformations. Required dataset should have the schema like below-

enter image description here

How to transpose this data to above expected schema in PySpark?

1
please replace screenshots with text data so that users can replicate your scenarioabiratsis

1 Answers

1
votes

Grouping the data after the xml file has been read will be difficult as the data contains no grouping criterium that indicated with rows belong together.

Instead using component as rootTag and and headers as rowTag you get the required grouping of the rows. Within a row, you can use map_from_entries to separate the entries from each other and then select the individual elements of the map as required:

df = spark.read \
    .format("com.databricks.spark.xml") \
    .option("rootTag", "component") \
    .option("rowTag", "headers") \
    .load("test.xml") \
    .withColumn("tmp", F.map_from_entries("header")) \
    .selectExpr("'Engineering' as Component", 
          "tmp['Date'] as Date", 
          "tmp['StdName'] as StdName", 
          "tmp['ID'] as ID", 
          "tmp['DeviceType'] as DeviceType") \
    .withColumn("processeddatetime",F.unix_timestamp(F.lit(timestamp),'yyyy-MM-dd HH:mm:ss').cast("timestamp")) \

The value of the column Component is hard coded here as you cannot get a value from the root tag element in Spark xml.