0
votes

I'm learning Python and I'm trying to extract lists of all tags and corresponding values from any XML file. This is my code so far.

def ParseXml(XmlFile):
    try:
        parser = etree.XMLParser(remove_blank_text=True, compact=True)
        tree = ET.parse(XmlFile, parser)
        root = tree.getroot()

        ListOfTags, ListOfValues, ListOfAttribs = [], [], []
        for elem in root.iter('*'):
            Tag = elem.tag
            ListOfTags.append(Tag)

            value = elem.text
            if value is not None:
                ListOfValues.append(value)
            else:
                ListOfValues.append('')

            attrib = elem.attrib
            if attrib:
                ListOfAttribs.append([attrib])
            else:
                ListOfAttribs.append([])
        print('%s File parsed successfully' % XmlFile)
        return (ListOfTags, ListOfValues, ListOfAttribs)

    except Exception as e:
        print('Error while parsing XMLs : %s : %s' % (type(e), e))
        return ([], [], [])

For an XML input like this:

<?xml version="1.0" encoding="UTF-8"?>
<Application Version="2.01">
    <UserAuthRequest>
        <VendorApp>
            <AppName>SING</AppName>
        </VendorApp>
    </UserAuthRequest>
    <ApplicationRequest ID="12-123-AH">
        <GUID>ABD45129-PD1212-121DFL</GUID>
        <Type tc="200">Streaming</Type>
        <File></File>
        <FileExtension VendorCode="200">
            <Result>
                <ResultCode tc="1">Success</ResultCode>
            </Result>
        </FileExtension>
    </ApplicationRequest>
</Application>

This output is multiple lists of tags, values and attributes. This is working fine.

['Application', 'UserAuthRequest', 'VendorApp', 'AppName', 'ApplicationRequest', 'GUID', 'Type', 'File', 'FileExtension', 'Result', 'ResultCode']
['', '', '', 'SING', '', 'ABD45129-PD1212-121DFL', 'Streaming', '', '', '', 'Success']
[[{'Version': '2.01'}], [], [], [], [{'ID': '12-123-AH'}], [], [{'tc': '200'}], [], [{'VendorCode': '200'}], [], [{'tc': '1'}]]

But my problem is that i need the tags including the parent and child tags. Like below is actual output I'm targetting:

['Application', 'UserAuthRequest', 'UserAuthRequest.VendorApp', 'UserAuthRequest.VendorApp.AppName', 'ApplicationRequest', 'ApplicationRequest.GUID', 'ApplicationRequest.Type', 'ApplicationRequest.File', 'ApplicationRequest.File.FileExtension', 'ApplicationRequest.File.FileExtension.Result', 'ApplicationRequest.File.FileExtension.Result.ResultCode']

How do i make this happen with Python? or is there any other alternate way to do this?

2
did you try using BeautifulSoup for this? - snapcrack
I read somewhere it is similar to lxml. Is it possible to get the required output using BeautifulSoup? If so, how? - Naveen
The target output seems to be inconsistent, at least for the root node children; they should be Application.UserAuthRequest and Application.ApplicationRequest . Also, there's no ApplicationRequest.File.* in the xml. - CristiFati
You are probably right. I had to manually writedown the output list. Is this output possible? - Naveen

2 Answers

0
votes

Here's a recursive approach that only uses [Python]: xml.etree.ElementTree — The ElementTree XML API:

import xml.etree.ElementTree as ET


def parse_node(node, ancestor_string=""):
    #print(type(node), dir(node))

    if ancestor_string:
        node_string = ".".join([ancestor_string, node.tag])
    else:
        node_string = node.tag
    tag_list = [node_string]
    text = node.text
    if text:
        text_list = [text.strip()]
    else:
        text_list = [""]
    attr_list = [node.attrib]
    for child_node in list(node):
        child_tag_list, child_text_list, child_attr_list = parse_node(child_node, ancestor_string=node_string)
        tag_list.extend(child_tag_list)
        text_list.extend(child_text_list)
        attr_list.extend(child_attr_list)
    return tag_list, text_list, attr_list


def parse_xml(file_name):
    tree = ET.parse("test.xml")
    root_node = tree.getroot()
    tags, texts, attrs = parse_node(root_node)
    print(tags)
    print(texts)
    print(attrs)


def main():
    parse_xml("a.xml")


if __name__ == "__main__":
    main()

Notes:

  • The idea is to "remember the path" in the xml tree. That is done via parse_node's ancestor_string argument, which is computed for each node in the tree and passed to its (direct) children
  • The naming differs from the one in the question because of [Python]: PEP 8 -- Style Guide for Python Code considerations
  • At 1st glance, it might seem that having 2 functions (main and parse_xml) where one just calls the other, only adds an useless level of nesting, but it's a good practice that I got used to
  • I corrected the attributes list. Instead a list of lists with each inner list containing a single dictionary, return a list of dictionaries

Output (I've run the script with Python 2.7 and Python 3.5):

['Application', 'Application.UserAuthRequest', 'Application.UserAuthRequest.VendorApp', 'Application.UserAuthRequest.VendorApp.AppName', 'Application.ApplicationRequest', 'Application.ApplicationRequest.GUID', 'Application.ApplicationRequest.Type', 'Application.ApplicationRequest.File', 'Application.ApplicationRequest.FileExtension', 'Application.ApplicationRequest.FileExtension.Result', 'Application.ApplicationRequest.FileExtension.Result.ResultCode']
['', '', '', 'SING', '', 'ABD45129-PD1212-121DFL', 'Streaming', '', '', '', 'Success']
[{'Version': '2.01'}, {}, {}, {}, {'ID': '12-123-AH'}, {}, {'tc': '200'}, {}, {'VendorCode': '200'}, {}, {'tc': '1'}]
0
votes

I believe this is what you need:

from bs4 import BeautifulSoup
from urllib.request import urlopen

soup = BeautifulSoup(yourlinkhere, 'lxml')

lst = []

for tag in soup.findChildren():
    if tag.child:
        lst.append(str(tag.name) + '.' + str(tag.child))
    else:
        lst.append(tag.name)