0
votes

I'm reading a text file that has unicode characters from many different countries. The data in the file is also in JSON format.

I'm working on a CentOS machine. When I open the file in a terminal, the unicode characters display just fine (so my termininal is configured for unicode).

When I test my code in Eclipse, it works fine. When I run my code in the terminal, it throws an error: UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 17: ordinal not in range(128)

for line in open("data-01083"):
    try:
        tmp = line
        if tmp == "":
            break
        theData = json.loads(tmp[41:]) 

        for loc in theData["locList"]:
            outLine = tmp[:40] 
            outLine = outLine + delim + theData["names"][0]["name"]
            outLine = outLine + delim + str(theData.get("Flagvalue"))
            outLine = outLine + delim + str(loc.get("myType"))
            flatAdd = ""
            srcAddr = loc.get("Address")
            if srcAddr != None:
                flatAdd = delim + str(srcAddr.get("houseNumber"))
                flatAdd = flatAdd + delim + str(srcAddr.get("streetName"))
                flatAdd = flatAdd + delim + str(srcAddr.get("postalCode"))
                flatAdd = flatAdd + delim + str(srcAddr.get("CountryCode"))
            else:
                 flatAdd = delim + "None" + delim + "None" + delim +"None" + delim +"None" + delim +"None"

            outLine = outLine + FlatAdd

            sys.stdout.write(("%s\n" % (outLine)).encode('utf-8'))
    except:
        sys.stdout.write("Error Processing record\n")

So everything works until it gets to StreetName, where it crashes with the UnicodeDecodeError, which is where the non-ascii characters start showing up.

I can fix that instance by added .encode('utf-8'):

 flatAdd = flatAdd + delim + str(srcAddr.get("streetName").encode('utf-8'))

but then it crashes with the UnicodeDecodeError on the next line:

outLine = outLine + FlatAdd

I have been stumbling through these types of issues for a month. Any feedback would be greatly appreciated!!

2
Robᵩ, thank you!!! I feel like Neo after he sees the bytes. - user1826936

2 Answers

1
votes

This might fix your problem. I'm saying might because encoding sometimes makes weird stuff happen ;)

#!/usr/bin/python
# -*- coding: utf-8 -*-

text_file_utf8 = text_file.encode('utf8')

From this point on you should be rid of the messages. If not so, please give feedback on what kind of file you have, the language. Maybe some file header data.

text_file.decode("ISO-8859-1") might also be a solution.

If all fails, look into codecs() here; http://docs.python.org/2/library/codecs.html

with codecs.open('your_file.extension', 'r', 'utf8') as indexKey:
    pass
    # Your code here
1
votes

The presentation from Robᵩ (http://nedbatchelder.com/text/unipain.html) REALLY helped with my understanding unicode. HIGHLY recommend it to anyone with unicode issues.

My take away:

  • Convert everthing to unicode as you ingest it into your app.
  • Use only unicode strings in your code
  • Specify the encoding as you output the data from your app.

For me, I was reading from stdin and a file and output to stdout:

For stdin:

inData = codecs.getreader('utf-8')(sys.stdin)

for a file:

inData = codecs.open("myFile","r","utf-8")

for stdout (do this once before writing anything to stdout):

sys.stdout = codecs.getwriter('utf-8')(sys.stdout)