0
votes

I use lxml.html to parse various html pages. Now i recognised that at least for some pages it doesn't find the body tag despite it is present and beautiful soup finds it (even though it uses lxml as parser).

example page: https://plus.google.com/ (what remains of it)

import lxml.html
import bs4

html_string = """
    ... source code of https://plus.google.com/ (manually copied) ...
"""

# lxml fails (body is None)
body = lxml.html.fromstring(html_string).find('body')

# Beautiful soup using lxml parser succeeds
body = bs4.BeautifulSoup(html_string, 'lxml').find('body')

any guess about what is happening here is welcome :)

Update:

The problem seems to be related to the encoding.

# working version
body = lxml.html.document_fromstring(html_string.encode('unicode-escape')).find('body')
1
What is your html_string?Jack Fleeting
@JackFleeting its the content of plus.google.com. I didn't add it as it is quite big.Raphael

1 Answers

1
votes

You can use something like this:

import requests
import lxml.html

html_string = requests.get("https://plus.google.com/").content
body = lxml.html.document_fromstring(html_string).find('body')

body variable contains body html element