I use lxml.html to parse various html pages. Now i recognised that at least for some pages it doesn't find the body tag despite it is present and beautiful soup finds it (even though it uses lxml as parser).
example page: https://plus.google.com/ (what remains of it)
import lxml.html
import bs4
html_string = """
... source code of https://plus.google.com/ (manually copied) ...
"""
# lxml fails (body is None)
body = lxml.html.fromstring(html_string).find('body')
# Beautiful soup using lxml parser succeeds
body = bs4.BeautifulSoup(html_string, 'lxml').find('body')
any guess about what is happening here is welcome :)
Update:
The problem seems to be related to the encoding.
# working version
body = lxml.html.document_fromstring(html_string.encode('unicode-escape')).find('body')
html_string
? – Jack Fleeting