28
votes

I'm learning about urllib2 and Beautiful Soup and on first tests am getting errors like:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 10: ordinal not in range(128)

There seem to be lots of posts about this type of error and I have tried the solutions that I can understand but there seem to be catch 22's with them, e.g.:

I want to print post.text (where text is a beautiful soup method that just returns the text). str(post.text) and post.text produce the unicode errors (on things like right apostrophe's ' and ...).

So I add post = unicode(post) above str(post.text), then I get:

AttributeError: 'unicode' object has no attribute 'text'

I also tried (post.text).encode() and (post.text).renderContents(). The latter producing the error:

AttributeError: 'unicode' object has no attribute 'renderContents'

and then I tried str(post.text).renderContents() and got the error:

AttributeError: 'str' object has no attribute 'renderContents'

It would be great if I could just define somewhere at the top of the document 'make this content 'interpretable'' and still have access to the required text function.


Update: after suggestions:

If I add post = post.decode("utf-8") above str(post.text) I get:

TypeError: unsupported operand type(s) for -: 'str' and 'int'  

If I add post = post.decode() above str(post.text) I get:

AttributeError: 'unicode' object has no attribute 'text'

If I add post = post.encode("utf-8") above (post.text) I get:

AttributeError: 'str' object has no attribute 'text'

I tried print post.text.encode('utf-8') and got:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 39: ordinal not in range(128)

And for the sake of trying things that might work, I installed lxml for Windows from here and implemented it with:

parsed_content = BeautifulSoup(original_content, "lxml")

according to http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters.

These steps didn't seem to make a difference.

I'm using Python 2.7.4 and Beautiful Soup 4.


Solution:

After getting a deeper understanding of unicode, utf-8 and Beautiful Soup types, it had something to do with my printing methodology. I removed all my str methods and concatenations, e.g. str(something) + post.text + str(something_else), so that it was something, post.text, something_else and it seems to be printing well except I have less control of the formatting at this stage (e.g. spaces inserted at ,).

3

3 Answers

46
votes

In Python 2, unicode objects can only be printed if they can be converted to ASCII. If it can't be encoded in ASCII, you'll get that error. You probably want to explicitly encode it and then print the resulting str:

print post.text.encode('utf-8')
2
votes
    html = urllib.request.urlopen(THE_URL).read()
    soup = BeautifulSoup(html)
    print("'" + str(soup.encode("ascii")) + "'")

worked for me ;-)

0
votes

Did you try .decode() or .decode("utf-8") ?

And, I recommend to use lxml using html5lib parser

http://lxml.de/html5parser.html