1
votes

I have a large HTML source code I would like to parse (~200,000) lines, and I'm fairly certain there is some poor formatting throughout. I've been researching some parsers, and it seems Beautiful Soup, lxml, html5lib are the most popular. From reading this website, it seems lxml is the most commonly used and fastest, while Beautiful Soup is slower but accounts for more errors and variation.

I'm a little confused on the Beautiful Soup documentation, http://www.crummy.com/software/BeautifulSoup/bs4/doc/, and commands like BeautifulSoup(markup, "lxml") or BeautifulSoup(markup, html5lib). In such instances is it using both Beautiful Soup and html5lib/lxml? Speed is not really an issue here, but accuracy is. The end goal is to parse get the source code using urllib2, and retrieve all the text data from the file as if I were to just copy/paste the webpage.

P.S. Is there anyway to parse the file without returning any whitespace that were not present in the webpage view?

1

1 Answers

4
votes

My understanding (having used BeautifulSoup for a handful of things) is that it is a wrapper for parsers like lxml or html5lib. Using whichever parser is specified (I believe the default is HTMLParser, the default parser for python), BeautifulSoup creates a tree of tag elements and such that make it quite easy to navigate and search the HTML for useful data continued within tags. If you really just need the text from the webpages and not more specific data from specific HTML tags, you might only need a code snippet similar to this:

from bs4 import BeautifulSoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen("http://www.google.com")
soup.get_text()

get_text isn't that great with complex webpages (it gets random javascript or css occasionally), but if you get the hang of how to use BeautifulSoup, it shouldn't be hard to get only the text you want.

For your purposes it seems like you don't need to worry about getting one of those other parsers to use with BeautifulSoup (html5lib or lxml). BeautifulSoup can deal with some sloppiness on its own, and if it can't, it will give an obvious error about "malformed HTML" or something of the sort, and that would be an indication to install html5lib or lxml.