I'm learning about urllib2 and Beautiful Soup and on first tests am getting errors like:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2026' in position 10: ordinal not in range(128)
There seem to be lots of posts about this type of error and I have tried the solutions that I can understand but there seem to be catch 22's with them, e.g.:
I want to print post.text
(where text is a beautiful soup method that just returns the text).
str(post.text)
and post.text
produce the unicode errors (on things like right apostrophe's '
and ...
).
So I add post = unicode(post)
above str(post.text)
, then I get:
AttributeError: 'unicode' object has no attribute 'text'
I also tried (post.text).encode()
and (post.text).renderContents()
.
The latter producing the error:
AttributeError: 'unicode' object has no attribute 'renderContents'
and then I tried str(post.text).renderContents()
and got the error:
AttributeError: 'str' object has no attribute 'renderContents'
It would be great if I could just define somewhere at the top of the document 'make this content 'interpretable''
and still have access to the required text
function.
Update: after suggestions:
If I add post = post.decode("utf-8")
above str(post.text)
I get:
TypeError: unsupported operand type(s) for -: 'str' and 'int'
If I add post = post.decode()
above str(post.text)
I get:
AttributeError: 'unicode' object has no attribute 'text'
If I add post = post.encode("utf-8")
above (post.text)
I get:
AttributeError: 'str' object has no attribute 'text'
I tried print post.text.encode('utf-8')
and got:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 39: ordinal not in range(128)
And for the sake of trying things that might work, I installed lxml for Windows from here and implemented it with:
parsed_content = BeautifulSoup(original_content, "lxml")
according to http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters.
These steps didn't seem to make a difference.
I'm using Python 2.7.4 and Beautiful Soup 4.
Solution:
After getting a deeper understanding of unicode, utf-8 and Beautiful Soup types, it had something to do with my printing methodology. I removed all my str
methods and concatenations, e.g. str(something) + post.text + str(something_else)
, so that it was something, post.text, something_else
and it seems to be printing well except I have less control of the formatting at this stage (e.g. spaces inserted at ,
).