Pyspark dataframe with XML column and multiple values inside: Extract columns out of it

Question

I have a pyspark dataframe where one column has a XML inside. Each XML in a row looks like that, some have 2 entries, some 3 and 4:

Example of one row entry:

<?xml version="1.0" encoding="utf-8"?> <goals> <goal id="445" name="xxxy" sex="F" /> <goal id="2468" name="qwerzui" sex="F" /> <goal id="4334" name="foo" sex="M" /> <goal id="15" name="fooh" sex="F" /> </goals>

I need to parse the values out of it for goal id, name and sex and create columns out of it.

Since an XML can have several entries, it is difficult to generate a fixed number of columns from it. My idea was to create a column for each of these attributes (means add 3 columns to the dataframe), where then lists are inside.

in this example expand the pyspark dataframe with columns:

goal id	name	sex
[445,2468,4334,15]	[xxxy,qwerzui,foo,fooh]	[F,F,M,F]

I was thinking of a UDF that iterates through the XML column and creates the columns from it. What do you think about this, does it make sense to do it this way? To also be able to analyze later. It is actually not common to have lists in columns.

I tried it via:

import xml.etree.ElementTree as ET

root = ET.fromstring(string)

and with the following i can access the values inside, but i am not able to put it in a proper udf function to expand my pyspark dataframe.

for child in root:
  print(child.tag, child.attrib)
  
for child in root:
  print(child.attrib['age'],child.attrib['sex'])

Unfortunately the other solutions from stackoverflow could not help me, so I hope for a solution for my problem

balderman balderman · Accepted Answer · 2020-12-19T13:49:18

The code below generates the 3 lists

import xml.etree.ElementTree as ET

XML = '''<?xml version="1.0" encoding="utf-8"?> <goals> <goal id="445" name="xxxy" sex="F" /> 
                                                        <goal id="2468" name="qwerzui" sex="F" /> <goal id="4334" name="foo" sex="M" /> 
                                                        <goal id="15" name="fooh" sex="F" /> 
                                                </goals>
'''
final = []
attributes = ['id', 'name', 'sex']
root = ET.fromstring(XML)
for attrib in attributes:
    final.append([g.attrib[attrib] for g in root.findall('goal')])
print(final)

output

[['445', '2468', '4334', '15'], ['xxxy', 'qwerzui', 'foo', 'fooh'], ['F', 'F', 'M', 'F']]

Pyspark dataframe with XML column and multiple values inside: Extract columns out of it

2 Answers