python libxml2 reader and XML_PARSE_RECOVER

Question

I'm trying to get a reader to recover from broken XML. Using the libxml2.XML_PARSE_RECOVER option with the DOM api (libxml2.readDoc) works and it recovers from entity problems.

However using the option with the reader API (which is essential due to the size of documents we are parsing) does not work. It just gets stuck in a perpetual loop (with reader.Read() returning -1):

Sample code (with small example):

import cStringIO
import libxml2

DOC = "<a>some broken & xml</a>"

reader = libxml2.readerForDoc(DOC, "urn:bogus", None, libxml2.XML_PARSE_RECOVER | libxml2.XML_PARSE_NOERROR)

ret = reader.Read()
while ret:
    print 'ret: %d' % ret
    print "node name: ", reader.Name(), reader.NodeType()
    ret = reader.Read()

Any ideas how to recover correctly?

Thanks but this doesn't recover, just aborts. So for the above I'd only get the <a> tag.The DOM api results in a document tree with the recovery just dropping the & - which is ideally what I'd like (equivalent) from the reader API. — bee

dcolish dcolish · Accepted Answer · 2010-10-29T21:18:15

I'm not too sure about the current state of the libxml2 bindings. Even the libxml2 site suggests using lxml instead. To parse this tree and ignore the & is nice and clean in lxml:

from cStringIO import StringIO
from lxml import etree

DOC = "<a>some broken & xml</a>"

reader = etree.XMLParser(recover=True)
tree = etree.parse(StringIO(DOC), reader)
print etree.tostring(tree.getroot())

The parsers page in the lxml docs goes into more detail about setting up a parser and iterating over the contents.

Edit:

If you want to parse a document incrementally the XMLparser class can be used as well since it is a subclass of _FeedParser:

DOC = "<a>some broken & xml</a>"
reader = etree.XMLParser(recover=True)

for data in StringIO(DOC).read():
    reader.feed(data)

tree = reader.close()
print etree.tostring(tree)

python libxml2 reader and XML_PARSE_RECOVER

4 Answers