I'm working on large projects which require fast HTML parsing, including recovery for broken HTML pages.
Currently lxml is my choice, I know it provides an interface for libxml2's recovery mode, too, but I'm not really happy with the results. For some specific HTML pages I found that BeautifulSoup works out really better results (example: http://fortune.com/2015/11/10/vw-scandal-volkswagen-gift-cards/, this one has a broken <header>
tag which lxml/libxml2 couldn't correct). However, the problem is BS is extremely slow.
As I see, modern browsers like Chrome and Firefox parse HTML very quickly and handle broken HTML really well. Like lxml, Chrome's parser is built on top of libxml2 and libxslt, but with more effective broken HTML handling algorithm. I hope there will be standalone repos exported from Chromium so that I can use them, but haven't found anything similar yet.
Does anyone know a good lib or at least a workaround (by utilizing parts of current known parsers)? Thanks a lot!