python - UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c

Question

I have a socket server that is supposed to receive UTF-8 valid characters from clients.

The problem is some clients (mainly hackers) are sending all the wrong kind of data over it.

I can easily distinguish the genuine client, but I am logging to files all the data sent so I can analyze it later.

Sometimes I get characters like this œ that cause the UnicodeDecodeError error.

I need to be able to make the string UTF-8 with or without those characters.

Update:

For my particular case the socket service was an MTA and thus I only expect to receive ASCII commands such as:

EHLO example.com
MAIL FROM: <[email protected]>
...

I was logging all of this in JSON.

Then some folks out there without good intentions decided to send all kind of junk.

That is why for my specific case it is perfectly OK to strip the non ASCII characters.

does the string come out of a file or a socket? could you please post code examples of how the string is encoded end decoded before it is send through the socket/filehandler? — devsnd
Did I write or didn't I write that the string comes over the socket? I simply read the string from the socket and with to put it in a dictionary and then JSON it to send it along. The JSON function failed due to those characters. — transilvlad

transilvlad transilvlad · Accepted Answer · 2012-09-17T23:05:11

str = unicode(str, errors='replace')

or

str = unicode(str, errors='ignore')

Note: This will strip out (ignore) the characters in question returning the string without them.

For me this is ideal case since I'm using it as protection against non-ASCII input which is not allowed by my application.

Alternatively: Use the open method from the codecs module to read in the file:

import codecs
with codecs.open(file_name, 'r', encoding='utf-8',
                 errors='ignore') as fdata: