I'm thinking this might be very easy, and I simply not figured it out yet.
The objective is to 'flatten' into a pandas DataFrame.
Here is one xml (A direct download of a 60~ MB zip file, which extracted inflates to around 800~ MB).
I have tried the following 2 approaches:
The first one, taken from here, has been modified a little bit:
def xml2dfa(xml_data):
tree = ET.parse(xml_data)
root = tree.getroot()[1] # Modification here
all_records = []
headers = []
for i, child in enumerate(root):
record = []
for subchild in child:
record.append(subchild.text)
if subchild.tag not in headers:
headers.append(subchild.tag)
all_records.append(record)
return pd.DataFrame(all_records, columns=headers)
Line 3 (root) was modified to get the element LEIRecords rather than LEIHeader
The previous results in a DataFrame of correct number of rows but only 4 columns:
array(['{http://www.leiroc.org/data/schema/leidata/2014}LEI',
'{http://www.leiroc.org/data/schema/leidata/2014}Entity',
'{http://www.leiroc.org/data/schema/leidata/2014}Registration',
'{http://www.leiroc.org/data/schema/leidata/2014}Extension'], dtype=object)
From columns 2 to 4 there are still nested children with information that could be extracted, but all of the information is lost, as the unique value of any column is an array that looks like this:
array(['\n '], dtype=object)
The second approach I have been running for at least 16 hours, with no result, so something is not right. I took that from here.
The expected output would be a DataFrame that is completely flat and for whatever information is not there (because a particular tree branch did not go that far, or was not populated, filled with NaN (as in this question)