As one of the problem statement I am working on parsing XML data using PySpark.
Below is the sample data -
<?xml version="1.0" encoding="UTF-8" standalone="no" ?>
<component>Engineering
<headers>
<header>
<name>Date</name>
<value>05/05/2021 14:50:03</value>
</header>
<header>
<name>StdName</name>
<value>CoreEngineering</value>
</header>
<header>
<name>ID</name>
<value>12432AA</value>
</header>
<header>
<name>DeviceType</name>
<value>EngineGear</value>
</header>
</headers>
<headers>
<header>
<name>Date</name>
<value>05/05/2021 14:59:13</value>
</header>
<header>
<name>StdName</name>
<value>CoreEngineering</value>
</header>
<header>
<name>ID</name>
<value>98344AA</value>
</header>
<header>
<name>DeviceType</name>
<value>EngineExhaust</value>
</header>
</headers>
</component>
While parsing this xml file in databricks using pyspark I am using below logic -
timestamp = datetime.datetime.fromtimestamp(time.time()).strftime('%Y-%m-%d %H:%M:%S')
df = spark.read \
.format("com.databricks.spark.xml") \
.option("rootTag", "headers") \
.option("rowTag", "header") \
.load("/mnt/test/sourcedata/sample.xml") \
.withColumn("processeddatetime",unix_timestamp(lit(timestamp),'yyyy-MM-dd HH:mm:ss').cast("timestamp"))
And the output received is :
However, wondering how I can transpose the data received since that is required for further data transformations. Required dataset should have the schema like below-
How to transpose this data to above expected schema in PySpark?