Preface: will your viewer work?
Make sure your viewer/editor/terminal (however you are interacting with your utf-8 encoded file) can read the file. This is frequently an issue on Windows, for example, Notepad.
Writing Unicode text to a text file?
In Python 2, use open
from the io
module (this is the same as the builtin open
in Python 3):
import io
Best practice, in general, use UTF-8
for writing to files (we don't even have to worry about byte-order with utf-8).
encoding = 'utf-8'
utf-8 is the most modern and universally usable encoding - it works in all web browsers, most text-editors (see your settings if you have issues) and most terminals/shells.
On Windows, you might try utf-16le
if you're limited to viewing output in Notepad (or another limited viewer).
encoding = 'utf-16le' # sorry, Windows users... :(
And just open it with the context manager and write your unicode characters out:
with io.open(filename, 'w', encoding=encoding) as f:
f.write(unicode_object)
Example using many Unicode characters
Here's an example that attempts to map every possible character up to three bits wide (4 is the max, but that would be going a bit far) from the digital representation (in integers) to an encoded printable output, along with its name, if possible (put this into a file called uni.py
):
from __future__ import print_function
import io
from unicodedata import name, category
from curses.ascii import controlnames
from collections import Counter
try: # use these if Python 2
unicode_chr, range = unichr, xrange
except NameError: # Python 3
unicode_chr = chr
exclude_categories = set(('Co', 'Cn'))
counts = Counter()
control_names = dict(enumerate(controlnames))
with io.open('unidata', 'w', encoding='utf-8') as f:
for x in range((2**8)**3):
try:
char = unicode_chr(x)
except ValueError:
continue # can't map to unicode, try next x
cat = category(char)
counts.update((cat,))
if cat in exclude_categories:
continue # get rid of noise & greatly shorten result file
try:
uname = name(char)
except ValueError: # probably control character, don't use actual
uname = control_names.get(x, '')
f.write(u'{0:>6x} {1} {2}\n'.format(x, cat, uname))
else:
f.write(u'{0:>6x} {1} {2} {3}\n'.format(x, cat, char, uname))
# may as well describe the types we logged.
for cat, count in counts.items():
print('{0} chars of category, {1}'.format(count, cat))
This should run in the order of about a minute, and you can view the data file, and if your file viewer can display unicode, you'll see it. Information about the categories can be found here. Based on the counts, we can probably improve our results by excluding the Cn and Co categories, which have no symbols associated with them.
$ python uni.py
It will display the hexadecimal mapping, category, symbol (unless can't get the name, so probably a control character), and the name of the symbol. e.g.
I recommend less
on Unix or Cygwin (don't print/cat the entire file to your output):
$ less unidata
e.g. will display similar to the following lines which I sampled from it using Python 2 (unicode 5.2):
0 Cc NUL
20 Zs SPACE
21 Po ! EXCLAMATION MARK
b6 So ¶ PILCROW SIGN
d0 Lu Ð LATIN CAPITAL LETTER ETH
e59 Nd ๙ THAI DIGIT NINE
2887 So ⢇ BRAILLE PATTERN DOTS-1238
bc13 Lo 밓 HANGUL SYLLABLE MIH
ffeb Sm → HALFWIDTH RIGHTWARDS ARROW
My Python 3.5 from Anaconda has unicode 8.0, I would presume most 3's would.