9
votes

I am switched from Python 2.7 to Python 3.6.

I have scripts that deal with some non-English content.

I usually run scripts via Cron and also in Terminal.

I had UnicodeDecodeError in my Python 2.7 scripts and I solved by this.

# encoding=utf8  
import sys  

reload(sys)  
sys.setdefaultencoding('utf8')

Now in Python 3.6, it doesnt work. I have print statements like print("Here %s" % (myvar)) and it throws error. I can solve this issue by replacing it to myvar.encode("utf-8") but I don't want to write with each print statement.

I did PYTHONIOENCODING=utf-8 in my terminal and I have still that issue.

Is there a cleaner way to solve UnicodeDecodeError issue in Python 3.6?

is there any way to tell Python3 to print everything in utf-8? just like I did in Python2?

6
Are the non-English files encoded properly in UTF-8 themselves?Edward Minnix
@EdwardMinnix I am scraping data from various Hewbrew/Korean sites, so data is not always clean.Umair Ayub
@usr2564301 is there any way to tell Python3 to print everything in utf-8? just like I did in Python2?Umair Ayub
Normally your terminal has an encoding defined which is used by Python to set the encoding of its file object (sys.stdout). Can you provide what sys.stdout.encoding is set to on your machine?Alfe
I think that is the root of the problem. What strange terminal are you using? In Unix-ish environments you can set the env var TERM to something like xterm or similar. Also the LANG variable could have an influence.Alfe

6 Answers

20
votes

It sounds like your locale is broken and have another bytes->Unicode issue. The thing you did for Python 2.7 is a hack that only masked the real problem (there's a reason why you have to reload sys to make it work).

To fix your locale, try typing locale from the command line. It should look something like:

LANG=en_GB.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_ALL=

locale depends on LANG being set properly. Python effectively uses locale to work out what encoding to use when writing to stdout in. If it can't work it out, it defaults to ASCII.

You should first attempt to fix your locale. If locale errors, make sure you've installed the correct language pack for your region.

If all else fails, you can always fix Python by setting PYTHONIOENCODING=UTF-8. This should be used as a last resort as you'll be masking problems once again.

If Python is still throwing an error after setting PYTHONIOENCODING then please update your question with the stacktrace. Chances are you've got an implied conversion going on.

3
votes

I had this issue when using Python inside a Docker container based on Ubuntu 18.04. It appeared to be a locale issue, which was solved by adding the following to the Dockerfile:

ENV LANG C.UTF-8
1
votes

For a Python-only solution you will have to recreate your sys.stdout object:

import sys, codecs
sys.stdout = codecs.getwriter('utf-8')(sys.stdout.detach())

After this, a normal print("hello world") should be encoded to UTF-8 automatically.

But you should try to find out why your terminal is set to such a strange encoding (which Python just tries to adopt to). Maybe your operating system is configured wrong somehow.

EDIT: In my tests unsetting the env variable LANG produced this strange setting for the stdout encoding for me:

LANG= python3
import sys
sys.stdout.encoding

printed 'ANSI_X3.4-1968'.

So I guess you might want to set your LANG to something like en_US.UTF-8. Your terminal program doesn't seem to do this.

1
votes

To everyone using pickle to load a file previously saved in python 2 and getting an UnicodeDecodeError, try setting pickle encoding parameter:

with open("./data.pkl", "rb") as data_file:
    samples = pickle.load(data_file, encoding='latin1')
-2
votes

Python 3 (including 3.6) is already Unicode supported. Here is the doc - https://docs.python.org/3/howto/unicode.html

So you don't need to force Unicode support like Python 2.7. Try to run your code normally. If you get any error reading a Unicode text file you need to use the encoding='utf-8' parameter while reading the file.

-3
votes

I mean you could write an custom function like this: (Not optimal i know)


import sys

def printUTF8(input):
    print(input.encode("utf-8"))