I have the following xml document
<a date="26-03-2018" id="1">
<text>
</text>
<metadata>
<b>
<c c="STRING1">
<d="value" e="string"/>
</c>
<c c="STRING2">
<d="value2" e="string" />
</c>
</b>
</metadata>
</a>
By using data bricks xml parser,I want to extract the string1,string2 values of "c" as a list to the column[metadata] of dataframe but when I infer with custom schema
schema = StructType([
StructField("date", StringType(), True),
StructField("id", LongType(), True),
StructField("text", StringType(), True),
StructField("metadata", StructType([
StructField("b", StringType(), True)]), True),])
and the dataframe for the above schema
----------------------------------------------------------------------------------------------------------------------
Id | date | text | metadata
----------------------------------------------------------------------------------------------------------------------
1 | 26-03-2018 | text |' <c c="STRING1"> <d="value" e="string"/></c><c c="STRING2"><d="value2" e="string" /> </c>'
I am getting the entire data as string from 'b' node. Any ideas on how to extract only strings using databricks xml parser to the column named metadata or is there any other parser available.I couldn't find the correct solution. I am new to spark. TIA