2
votes

I work with BeautifulSoup using lxml to parse and navigate XML files.

I noticed strange behaviour. Beautifulsoup suppresses exceptions thrown by lxml parser when reading malformed XML file (eg. truncated doc or missing closing tags).

Example:

from bs4 import BeautifulSoup
soup = BeautifulSoup("<foo><bar>trololo<", "xml") # this will work

It'll even possible to call find() and navigate such broken XML tree...

Let's try reading exactly the same malformed document with pure lxml:

from lxml import etree
root = etree.fromstring("<foo><bar>trololo<") # will throw XMLSyntaxError

Why is this? I know BeautifulSoup itself is not doing any parsing, it's just a wrapper library around lxml (or other parsers). But I'm interested in actually getting errors if XML is malformed, e.g. closing tags are missing. I want just the basic XML syntax validation (not interested in XSD schema validation stuff).

1

1 Answers

3
votes

If you want to replicate the behaviour you can set recover=True passing the parser:

from lxml import etree

root = etree.fromstring("<foo><bar>trololo<",parser=etree.XMLParser(recover=True)) # will throw XMLSyntaxError

print(etree.tostring(root))

Output:

<foo><bar>trololo</bar></foo>

If you look at the bs4 source code in the builder directory you will see _lxml.py and inside that:

def default_parser(self, encoding):
    # This can either return a parser object or a class, which
    # will be instantiated with default arguments.
    if self._default_parser is not None:
        return self._default_parser
    return etree.XMLParser(
        target=self, strip_cdata=False, recover=True, encoding=encoding)

lxml's HTMLParser sets this by default so it can deal with broken html, with xml you have to specify that you want to try and recover.