2
votes

I am doing a bit of web scraping with BeautifulSoup4 and am having problems with decoding response. Website returns me header, that in the header says:

content-type: text/html; charset=ISO-8859-1

So normally I decode it with latin1 charset. But then after decoding, there is a line in html, that says:

<meta content="text/html; charset=utf-8" http-equiv="content-type" />

And now from this line on the string is not decoded properly.

So what is the normal way to handle this? I would like to set accept-encoding line in the outgoing http header, but couldn't find a way to do it. Other option is to decode line by line searching for new charset but would prefer do it simply by only acception utf-8

I use Python3, libray http.client

EDIT1: Code:

import http.client as cl
from bs4 import BeautifulSoup

conn = cl.HTTPConnection('www.amazon.com')
conn.request("GET", '/A-Man-For-All-Seasons/dp/B003TQ1IW6/ref=sr_1_109?s=instant-video&ie=UTF8&qid=1348337540&sr=1-109')
response = conn.getresponse()
content = response.read()

soup = BeautifulSoup(content)
f = open('am.html', 'w')
f.write(soup.prettify())

#i am actually doing this with httplib2 but result is the same

EDIT2: Looks like something really is wrong with the configuration of Beautiful Soup 4 in Linux or it's a bug. This is working, but I cannot parse response with BS4:

import httplib2
h = httplib2.Http('.cache')
response, content = h.request(movieLink , headers={'accept-charset': 'latin1'})
content = content.decode('latin-1')

Thank you, Blckknght.

2
Your code works for me (on Windows with Python 3.2.3 and Beautiful Soup 4.1.3). I don't get anything messed up in the output file.Blckknght
Any reason for not just using Amazon's API? Oh, and welcome to SO!vzwick
Setting correct request header and encoding in default can solve this, datascraping.co/doc/questions/21/…Vikash Rathee

2 Answers

4
votes

Reading through the Beautiful Soup documentation it looks like there are two decent approaches.

  1. The best solution is probably to not decode the HTML document yourself and just give the raw byte string to Beautiful Soup. It will figure out the right encoding, and decode the document automatically (using its included Unicode Dammit library). It will find and interpret the relevant HTML meta tag if there is one, or analyze the document's contents and make a guess. This should certainly solve your immediate case, and even for documents without meta tags, it will probably get it right most of the time. Scanning the document may be a bit slow though, so if performance is a significant issue you might prefer the next option.

  2. The next best solution may be to apply your own knowledge to the issue. If the page you're scraping is always encoded as UTF-8, you can simply use that always, regardless of what the server says. This is of course dependent on the page encoding being consistent, which may or may not be the case (e.g. a website with some UTF-8 pages and some Latin-1 pages). If you're only scraping a single page (or single type of page, on a dynamic site) you're likely to find the same encoding always, so this can work well. The virtue of this approach is its simplicity (and to a lesser extent, speed), but it comes at the cost of flexibility and robustness. Your script is likely to break if the site changes the encoding it uses.

0
votes

This might be a duplicate of BeautifulSoup not reading documents correctly, i.e. was caused by a bug in BS 4.0.2.

That bug has been fixed in 4.0.3. You might want to check the output of

>>> import bs4
>>> bs4.__version__

If it's 4.0.2, upgrade BeautifulSoup to a later version.