8
votes

I'm trying to get a reader to recover from broken XML. Using the libxml2.XML_PARSE_RECOVER option with the DOM api (libxml2.readDoc) works and it recovers from entity problems.

However using the option with the reader API (which is essential due to the size of documents we are parsing) does not work. It just gets stuck in a perpetual loop (with reader.Read() returning -1):

Sample code (with small example):

import cStringIO
import libxml2

DOC = "<a>some broken & xml</a>"

reader = libxml2.readerForDoc(DOC, "urn:bogus", None, libxml2.XML_PARSE_RECOVER | libxml2.XML_PARSE_NOERROR)

ret = reader.Read()
while ret:
    print 'ret: %d' % ret
    print "node name: ", reader.Name(), reader.NodeType()
    ret = reader.Read()

Any ideas how to recover correctly?

4
Yes: while ret == 1:. See xmlsoft.org/xmlreader.html.Bernd Petersohn
Thanks but this doesn't recover, just aborts. So for the above I'd only get the <a> tag.The DOM api results in a document tree with the recovery just dropping the & - which is ideally what I'd like (equivalent) from the reader API.bee

4 Answers

1
votes

I'm not too sure about the current state of the libxml2 bindings. Even the libxml2 site suggests using lxml instead. To parse this tree and ignore the & is nice and clean in lxml:

from cStringIO import StringIO
from lxml import etree

DOC = "<a>some broken & xml</a>"

reader = etree.XMLParser(recover=True)
tree = etree.parse(StringIO(DOC), reader)
print etree.tostring(tree.getroot())

The parsers page in the lxml docs goes into more detail about setting up a parser and iterating over the contents.

Edit:

If you want to parse a document incrementally the XMLparser class can be used as well since it is a subclass of _FeedParser:

DOC = "<a>some broken & xml</a>"
reader = etree.XMLParser(recover=True)

for data in StringIO(DOC).read():
    reader.feed(data)

tree = reader.close()
print etree.tostring(tree)
0
votes

Isn't the xml broken in some consistent way? Isn't there some pattern you could follow to repair your xml before parsing?

For example - if the error is caused only by unescaped ampersands and you don't use CDATA or processing instructions, it can be repaired with a regexp.

EDIT: Then take a look at sgmllib in python standard library. BeautifulSoup uses it, so it can be useful in your case. (BeatifulSoup itself offers only the tree representation, not the events).

0
votes

Consider using xml.sax. When I'm presented really malformed XML that can have a plethora of different problems try dividing the problem into small pieces.

You mentioned that you have a very large XML file, well it probably has many records that you process serially. And each record (e.g. <item>...</item> has a start and end tag, presumably - these will will your recovery points.

In xml.sax you provide the reader, the handler, and the input sources. At worse a single records will be unrecoverable with this technique. Its a little more setup, but incrementally parsing a malformed feed a record at a time logging the bad records is probably the best you can do.

In the logs make sure to give yourself enough information to rebuild the original record so you can add additional recovery code for all the cases that you'll no doubt have to handle (e.g. create a badrecords_today's date.xml so you can reprocess manually).

Good luck.

0
votes

Or, you could use BeautifulSoup. It does a nice job recovering broken ML.