I'm trying to get a content of a web page and parse it than save in mysql db.
I actually did it for a web page encoding utf8.
But when i tried with a 8859-9 encoding webpage i get error.
My code to get content of page:
def getcontent(url):
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Magic Browser')]
opener.addheaders = [('Accept-Charset', 'utf-8')]
#print chardet.detect(response).get('encoding)
response = opener.open(url).read()
opener.close()
return response
url = "http://www.meb.gov.tr/duyurular/index.asp?ID=4"
contentofpage = getcontent(url)
print contentofpage
print chardet.detect(contentofpage)
print contentofpage.encode("utf-8")
output of content of page: ... E�itim Teknolojileri Genel M�d�rl��� ...
{'confidence': 0.7789909202570836, 'encoding': 'ISO-8859-2'}
Traceback (most recent call last):
File "meb.py", line 18, in <module>
print contentofpage.encode("utf-8")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 458: ordinal not in range(128)
Actually page is a Turkish page and encoding is 8859-9.
When i tried with default encoding all i see ��� instead of some chars. How can i take or convert content of page to utf-8 or turkish (iso-8859-9)
Also when i use unicode(contentofpage)
it get
Traceback (most recent call last): File "meb.py", line 20, in print unicode(contentofpage) UnicodeDecodeError: 'ascii' codec can't decode byte 0xee in position 458: ordinal not in range(128)
any help ?