0
votes

I'm using BeautifulSoup and trying to read a site which is written in hebrew and encoded in windows-1255 according to this line:

<meta http-EQUIV="Content-Type" Content="text/html; charset=windows-1255">

when I'm trying to encode it, I get the following error:

> UnicodeEncodeError: 'charmap' codec can't encode characters in position 6949-6950: character maps to <undefined>

The code:

from bs4 import BeautifulSoup
import requests

r = requests.get('http://www.plonter.co.il')
soup = BeautifulSoup(r.text)
print soup.prettify().encode('windows-1255') 
1

1 Answers

2
votes

If the site is already encoded in windows-1255 you should decode it to get unicode or just use it with the current encoding.

--edit I didn't know r.text was already decoded.

>>> import requests
>>> r = requests.get('http://www.plonter.co.il')
>>> isinstance(r.text, unicode)
True
>>> isinstance(r.content, unicode)
False
>>> isinstance(r.content, str)
True
>>> r.encoding
'ISO-8859-1'
>>> r.content.decode(r.encoding).encode('utf-8')  # works
>>> r.content.decode(r.encoding).encode('windows-1255') # fails
>>> r.content.decode(r.encoding).encode('windows-1255', 'ignore'). # works
>>> r.content.decode(r.encoding).encode('windows-1252') # works

So, I think you got the encoding "wrong". 'windows-1255' can't handle the content encode without errors. On the other hand 'utf-8', 'iso-8859-1' and 'windows-1252' seem to be able to handle it.

>>> r.content.decode(r.encoding) == r.text
True