I am doing a bit of web scraping with BeautifulSoup4 and am having problems with decoding response. Website returns me header, that in the header says:
content-type: text/html; charset=ISO-8859-1
So normally I decode it with latin1 charset. But then after decoding, there is a line in html, that says:
<meta content="text/html; charset=utf-8" http-equiv="content-type" />
And now from this line on the string is not decoded properly.
So what is the normal way to handle this? I would like to set accept-encoding line in the outgoing http header, but couldn't find a way to do it. Other option is to decode line by line searching for new charset but would prefer do it simply by only acception utf-8
I use Python3, libray http.client
EDIT1: Code:
import http.client as cl
from bs4 import BeautifulSoup
conn = cl.HTTPConnection('www.amazon.com')
conn.request("GET", '/A-Man-For-All-Seasons/dp/B003TQ1IW6/ref=sr_1_109?s=instant-video&ie=UTF8&qid=1348337540&sr=1-109')
response = conn.getresponse()
content = response.read()
soup = BeautifulSoup(content)
f = open('am.html', 'w')
f.write(soup.prettify())
#i am actually doing this with httplib2 but result is the same
EDIT2: Looks like something really is wrong with the configuration of Beautiful Soup 4 in Linux or it's a bug. This is working, but I cannot parse response with BS4:
import httplib2
h = httplib2.Http('.cache')
response, content = h.request(movieLink , headers={'accept-charset': 'latin1'})
content = content.decode('latin-1')
Thank you, Blckknght.