Python 2.7: Read file with Chinese characters

Question

I am trying to analyze data within CSV files with Chinese characters in their names (E.g. "粗1 25g"). I am using Tkinter to choose the files like so:

selectedFiles = askopenfilenames(filetypes=[("xlsx","*"),("xls","*")]) # Utilize Tkinker dialog window to choose files
selectedFiles = master.tk.splitlist(selectedFiles) # Create list from files chosen

I have attempted to convert the filename to unicode in this way:

selectedFiles = [x.decode("utf-8") for x in selectedFiles]

Only to yield the error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xb4 in position 0: ordinal not in range(128)

I have also tried converting the filenames as the files are created with the following:

titles = [x.encode('utf-8') for x in titles]

Only to receive the error:

IOError: [Errno 22] invalid mode ('wb') or filename: 'C:\...\\data_division_files\\\xe7\xb2\x971 25g.csv'

I have also tried combinations of the above methods to no avail. What can I do to allow these files to be read in Python?

(This question,while related, has not been able to solve my problem: Obtain File size with os.path.getsize() in Python 2.7.5)

You have to know, which encoding is used for your filenames. Judging from your error message, it may be utf16. Try filename.decode("utf16") — Giacomo d'Antonio
import codecs and then use proper methods of codecs on chinese filename text. import codecs — Nilesh
@NileshG: codecs is for Unicode contents; it doesn't do any good for Unicode filenames. — abarnert
@abarnert: Filename is nothing but a text... So we can use codecs ... — Nilesh
I think the problem lies in the naming of the files in the first place. If I change the elements of the list of files I have selected, then those files don't exist. I must instead make sure the original filenames are readable by the askopenfilenames function and other functions I am performing on them. — salamander

abarnert abarnert · Accepted Answer · 2013-10-18T09:11:38

When you call decode on a unicode object, it first encodes it with sys.getdefaultencoding() so it can decode it for you. Which is why you get an error about ASCII even though you didn't ask for ASCII anywhere.

So, where are you getting a unicode object from? From askopenfilename. From a quick test, it looks like it always returns unicode values on Windows (presumably by getting the UTF-16 and decoding it), while on POSIX it returns some unicode and some str (I'd guess by leaving alone anything that fits into 7-bit ASCII, decoding anything else with your filesystem encoding). If you'd tried printing out the repr or type or anything of selectedFiles, the problem would have been obvious.

Meanwhile, the encode('utf-8') shouldn't cause any UnicodeErrors… but it's likely that your filesystem encoding isn't UTF-8 on Windows, so it will probably cause a lot of IOErrors with errno 2 (trying to open files that don't exist, or to create files in directories that don't exist), 21 (trying to open files with illegal file or directory names on Windows), etc. And it looks like that's exactly what you're seeing. And there's really no reason to do it; just pass the pathnames as-is to open and they'll be fine.

So, basically, if you removed all of your encode and decode calls, your code would probably just work.

However, there's an even easier solution: Just use askopenfile or asksaveasfile instead of askopenfilename or asksaveasfilename. Let Tk figure out how to use its pathnames and just hand you the file objects, instead of messing with the pathnames yourself.

Python 2.7: Read file with Chinese characters

1 Answers