Convert XML into Lists of Tags and Values with Python

Question

I'm learning Python and I'm trying to extract lists of all tags and corresponding values from any XML file. This is my code so far.

def ParseXml(XmlFile):
    try:
        parser = etree.XMLParser(remove_blank_text=True, compact=True)
        tree = ET.parse(XmlFile, parser)
        root = tree.getroot()

        ListOfTags, ListOfValues, ListOfAttribs = [], [], []
        for elem in root.iter('*'):
            Tag = elem.tag
            ListOfTags.append(Tag)

            value = elem.text
            if value is not None:
                ListOfValues.append(value)
            else:
                ListOfValues.append('')

            attrib = elem.attrib
            if attrib:
                ListOfAttribs.append([attrib])
            else:
                ListOfAttribs.append([])
        print('%s File parsed successfully' % XmlFile)
        return (ListOfTags, ListOfValues, ListOfAttribs)

    except Exception as e:
        print('Error while parsing XMLs : %s : %s' % (type(e), e))
        return ([], [], [])

For an XML input like this:

<?xml version="1.0" encoding="UTF-8"?>
<Application Version="2.01">
    <UserAuthRequest>
        <VendorApp>
            <AppName>SING</AppName>
        </VendorApp>
    </UserAuthRequest>
    <ApplicationRequest ID="12-123-AH">
        <GUID>ABD45129-PD1212-121DFL</GUID>
        <Type tc="200">Streaming</Type>
        <File></File>
        <FileExtension VendorCode="200">
            <Result>
                <ResultCode tc="1">Success</ResultCode>
            </Result>
        </FileExtension>
    </ApplicationRequest>
</Application>

This output is multiple lists of tags, values and attributes. This is working fine.

['Application', 'UserAuthRequest', 'VendorApp', 'AppName', 'ApplicationRequest', 'GUID', 'Type', 'File', 'FileExtension', 'Result', 'ResultCode']
['', '', '', 'SING', '', 'ABD45129-PD1212-121DFL', 'Streaming', '', '', '', 'Success']
[[{'Version': '2.01'}], [], [], [], [{'ID': '12-123-AH'}], [], [{'tc': '200'}], [], [{'VendorCode': '200'}], [], [{'tc': '1'}]]

But my problem is that i need the tags including the parent and child tags. Like below is actual output I'm targetting:

['Application', 'UserAuthRequest', 'UserAuthRequest.VendorApp', 'UserAuthRequest.VendorApp.AppName', 'ApplicationRequest', 'ApplicationRequest.GUID', 'ApplicationRequest.Type', 'ApplicationRequest.File', 'ApplicationRequest.File.FileExtension', 'ApplicationRequest.File.FileExtension.Result', 'ApplicationRequest.File.FileExtension.Result.ResultCode']

How do i make this happen with Python? or is there any other alternate way to do this?

I read somewhere it is similar to lxml. Is it possible to get the required output using BeautifulSoup? If so, how? — Naveen
The target output seems to be inconsistent, at least for the root node children; they should be Application.UserAuthRequest and Application.ApplicationRequest . Also, there's no ApplicationRequest.File.* in the xml. — CristiFati
You are probably right. I had to manually writedown the output list. Is this output possible? — Naveen

CristiFati CristiFati · Accepted Answer · 2017-06-19T05:34:23

Here's a recursive approach that only uses [Python]: xml.etree.ElementTree — The ElementTree XML API:

import xml.etree.ElementTree as ET


def parse_node(node, ancestor_string=""):
    #print(type(node), dir(node))

    if ancestor_string:
        node_string = ".".join([ancestor_string, node.tag])
    else:
        node_string = node.tag
    tag_list = [node_string]
    text = node.text
    if text:
        text_list = [text.strip()]
    else:
        text_list = [""]
    attr_list = [node.attrib]
    for child_node in list(node):
        child_tag_list, child_text_list, child_attr_list = parse_node(child_node, ancestor_string=node_string)
        tag_list.extend(child_tag_list)
        text_list.extend(child_text_list)
        attr_list.extend(child_attr_list)
    return tag_list, text_list, attr_list


def parse_xml(file_name):
    tree = ET.parse("test.xml")
    root_node = tree.getroot()
    tags, texts, attrs = parse_node(root_node)
    print(tags)
    print(texts)
    print(attrs)


def main():
    parse_xml("a.xml")


if __name__ == "__main__":
    main()

Notes:

The idea is to "remember the path" in the xml tree. That is done via parse_node's ancestor_string argument, which is computed for each node in the tree and passed to its (direct) children
The naming differs from the one in the question because of [Python]: PEP 8 -- Style Guide for Python Code considerations
At 1^st glance, it might seem that having 2 functions (main and parse_xml) where one just calls the other, only adds an useless level of nesting, but it's a good practice that I got used to
I corrected the attributes list. Instead a list of lists with each inner list containing a single dictionary, return a list of dictionaries

Output (I've run the script with Python 2.7 and Python 3.5):

['Application', 'Application.UserAuthRequest', 'Application.UserAuthRequest.VendorApp', 'Application.UserAuthRequest.VendorApp.AppName', 'Application.ApplicationRequest', 'Application.ApplicationRequest.GUID', 'Application.ApplicationRequest.Type', 'Application.ApplicationRequest.File', 'Application.ApplicationRequest.FileExtension', 'Application.ApplicationRequest.FileExtension.Result', 'Application.ApplicationRequest.FileExtension.Result.ResultCode']
['', '', '', 'SING', '', 'ABD45129-PD1212-121DFL', 'Streaming', '', '', '', 'Success']
[{'Version': '2.01'}, {}, {}, {}, {'ID': '12-123-AH'}, {}, {'tc': '200'}, {}, {'VendorCode': '200'}, {}, {'tc': '1'}]

Convert XML into Lists of Tags and Values with Python

2 Answers