0
votes

I have a bunch of HTML files I downloaded using HTTPLIB2 package in Python. ' ' are showing as 'Â '.

<font color="#ff0000">02/12/2004Â </font> is showing while <font color="#ff0000">02/12/2004&nbsp;</font> is the desired format.

How do I replace the 'Â ' with '&nbsp;' in Python? Thanks a lot!

3
Yes it is slightly different from the original HTML. I am using httplib2 to download them and not a real browser. Is there somthing I have to include in the header for httlib2 to download the page as is?ThinkCode

3 Answers

1
votes

You've got an encoding problem. Instead of trying to remove this characters, look for the encoding of the page, then when you read the file, use the codecs module instead of open(), using the proper character encoding.

0
votes
filtered_content = filter(lambda x: x in string.printable, content)

This solved my problem. Thank you!

-1
votes
s.replace('Â ', '&nbsp;');

However, while I haven't used HTTPLIB2, I'm pretty sure something is wrong if the source of the HTML files is being changed when you download them. It may be that there's a decoding problem going on. What version of Python are you using? If it's Python 3, the contents will be byte sequences, not strings, so you'll have to specify the right codepage to decode the bytes to.

http://code.google.com/p/httplib2/wiki/ExamplesPython3

EDIT: If you aren't limited to using just httplib2, perhaps you could try looking into using the urllib, urllib2, or httplib modules that are part of the Python 2.6 standard library?