I have a large HTML source code I would like to parse (~200,000) lines, and I'm fairly certain there is some poor formatting throughout. I've been researching some parsers, and it seems Beautiful Soup, lxml, html5lib are the most popular. From reading this website, it seems lxml is the most commonly used and fastest, while Beautiful Soup is slower but accounts for more errors and variation.
I'm a little confused on the Beautiful Soup documentation, http://www.crummy.com/software/BeautifulSoup/bs4/doc/, and commands like BeautifulSoup(markup, "lxml") or BeautifulSoup(markup, html5lib). In such instances is it using both Beautiful Soup and html5lib/lxml? Speed is not really an issue here, but accuracy is. The end goal is to parse get the source code using urllib2, and retrieve all the text data from the file as if I were to just copy/paste the webpage.
P.S. Is there anyway to parse the file without returning any whitespace that were not present in the webpage view?