77
votes

I am writing a script that will try encoding bytes into many different encodings in Python 2.6. Is there some way to get a list of available encodings that I can iterate over?

The reason I'm trying to do this is because a user has some text that is not encoded correctly. There are funny characters. I know the unicode character that's messing it up. I want to be able to give them an answer like "Your text editor is interpreting that string as X encoding, not Y encoding". I thought I would try to encode that character using one encoding, then decode it again using another encoding, and see if we get the same character sequence.

i.e. something like this:

for encoding1, encoding2 in itertools.permutation(encodinglist(), 2):
  try:
    unicode_string = my_unicode_character.encode(encoding1).decode(encoding2)
  except:
    pass
9
Perhaps you should start a new question, giving details of what the actual problem is, including how you know what is the Unicode character that's messing it up, and what "messing it up" means, and what the "funny characters" are, etc etc. If the offending data is in a file, show the relevant part of the output of print repr(open('thefile.txt', 'rb').read())John Machin
I needed this functionality when cleaning non-UTF filenames from a large file share. There was no telling what the original encoding for many files was... Some of these embedded "odd" single bytes didn't fit any code points in Windows-1252 or ISO-8859, and a useful way of guessing what set they came from was to get Python to convert the single byte to each encoding it can, and see if the result was reasonable. Then fix the filename.Joe Koberg
For example b'Bj\x94rk' didn't fit ISO-8859-1 but after trying them all I see it fit CP850 or CP437.Joe Koberg
I implemented a script which uses ideas of Anurag Uniyal and u0b34a0f6ae to get list of available codecs. The script also tests codecs on all byte values and measures performance.Vasily E.

9 Answers

38
votes

Unfortunately encodings.aliases.aliases.keys() is NOT an appropriate answer.

aliases(as one would/should expect) contains several cases where different keys are mapped to the same value e.g. 1252 and windows_1252 are both mapped to cp1252. You could save time if instead of aliases.keys() you use set(aliases.values()).

BUT THERE'S A WORSE PROBLEM: aliases doesn't contain codecs that don't have aliases (like cp856, cp874, cp875, cp737, and koi8_u).

>>> from encodings.aliases import aliases
>>> def find(q):
...     return [(k,v) for k, v in aliases.items() if q in k or q in v]
...
>>> find('1252') # multiple aliases
[('1252', 'cp1252'), ('windows_1252', 'cp1252')]
>>> find('856') # no codepage 856 in aliases
[]
>>> find('koi8') # no koi8_u in aliases
[('cskoi8r', 'koi8_r')]
>>> 'x'.decode('cp856') # but cp856 is a valid codec
u'x'
>>> 'x'.decode('koi8_u') # but koi8_u is a valid codec
u'x'
>>>

It's also worth noting that however you obtain a full list of codecs, it may be a good idea to ignore the codecs that aren't about encoding/decoding character sets, but do some other transformation e.g. zlib, quopri, and base64.

Which brings us to the question of WHY you want to "try encoding bytes into many different encodings". If we know that, we may be able to steer you in the right direction.

For a start, that's ambiguous. One DEcodes bytes into unicode, and one ENcodes unicode into bytes. Which do you want to do?

What are you really trying to achieve: Are you trying to determine which codec to use to decode some incoming bytes, and plan to attempt this with all possible codecs? [note: latin1 will decode anything] Are you trying to determine the language of some unicode text by trying to encode it with all possible codecs? [note: utf8 will encode anything].

90
votes

Other answers here seem to indicate that constructing this list programmatically is difficult and fraught with traps. However, doing so is probably unnecessary since the documentation contains a complete list of the standard encodings Python supports, and has done since Python 2.3.

You can find these lists (for each stable version of the language so far released) at:

Below are the lists for each documented version of Python. Note that if you want backwards-compatibility rather than just supporting a particular version of Python, you can just copy the list from the latest Python version and check whether each encoding exists in the Python running your program before trying to use it.

Python 2.3 (59 encodings)

['ascii',
 'cp037',
 'cp424',
 'cp437',
 'cp500',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp869',
 'cp874',
 'cp875',
 'cp1006',
 'cp1026',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'koi8_r',
 'koi8_u',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8']

Python 2.4 (85 encodings)

['ascii',
 'big5',
 'big5hkscs',
 'cp037',
 'cp424',
 'cp437',
 'cp500',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp874',
 'cp875',
 'cp932',
 'cp949',
 'cp950',
 'cp1006',
 'cp1026',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'euc_jp',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_kr',
 'gb2312',
 'gbk',
 'gb18030',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'johab',
 'koi8_r',
 'koi8_u',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'ptcp154',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8']

Python 2.5 (86 encodings)

['ascii',
 'big5',
 'big5hkscs',
 'cp037',
 'cp424',
 'cp437',
 'cp500',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp874',
 'cp875',
 'cp932',
 'cp949',
 'cp950',
 'cp1006',
 'cp1026',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'euc_jp',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_kr',
 'gb2312',
 'gbk',
 'gb18030',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'johab',
 'koi8_r',
 'koi8_u',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'ptcp154',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8',
 'utf_8_sig']

Python 2.6 (90 encodings)

['ascii',
 'big5',
 'big5hkscs',
 'cp037',
 'cp424',
 'cp437',
 'cp500',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp874',
 'cp875',
 'cp932',
 'cp949',
 'cp950',
 'cp1006',
 'cp1026',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'euc_jp',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_kr',
 'gb2312',
 'gbk',
 'gb18030',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'iso8859_16',
 'johab',
 'koi8_r',
 'koi8_u',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'ptcp154',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8',
 'utf_8_sig']

Python 2.7 (93 encodings)

['ascii',
 'big5',
 'big5hkscs',
 'cp037',
 'cp424',
 'cp437',
 'cp500',
 'cp720',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp858',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp874',
 'cp875',
 'cp932',
 'cp949',
 'cp950',
 'cp1006',
 'cp1026',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'euc_jp',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_kr',
 'gb2312',
 'gbk',
 'gb18030',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_11',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'iso8859_16',
 'johab',
 'koi8_r',
 'koi8_u',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'ptcp154',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8',
 'utf_8_sig']

Python 3.0 (89 encodings)

['ascii',
 'big5',
 'big5hkscs',
 'cp037',
 'cp424',
 'cp437',
 'cp500',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp874',
 'cp875',
 'cp932',
 'cp949',
 'cp950',
 'cp1006',
 'cp1026',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'euc_jp',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_kr',
 'gb2312',
 'gbk',
 'gb18030',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'johab',
 'koi8_r',
 'koi8_u',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'ptcp154',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8',
 'utf_8_sig']

Python 3.1 (90 encodings)

['ascii',
 'big5',
 'big5hkscs',
 'cp037',
 'cp424',
 'cp437',
 'cp500',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp874',
 'cp875',
 'cp932',
 'cp949',
 'cp950',
 'cp1006',
 'cp1026',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'euc_jp',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_kr',
 'gb2312',
 'gbk',
 'gb18030',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'iso8859_16',
 'johab',
 'koi8_r',
 'koi8_u',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'ptcp154',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8',
 'utf_8_sig']

Python 3.2 (92 encodings)

['ascii',
 'big5',
 'big5hkscs',
 'cp037',
 'cp424',
 'cp437',
 'cp500',
 'cp720',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp858',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp874',
 'cp875',
 'cp932',
 'cp949',
 'cp950',
 'cp1006',
 'cp1026',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'euc_jp',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_kr',
 'gb2312',
 'gbk',
 'gb18030',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'iso8859_16',
 'johab',
 'koi8_r',
 'koi8_u',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'ptcp154',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8',
 'utf_8_sig']

Python 3.3 (93 encodings)

['ascii',
 'big5',
 'big5hkscs',
 'cp037',
 'cp424',
 'cp437',
 'cp500',
 'cp720',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp858',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp874',
 'cp875',
 'cp932',
 'cp949',
 'cp950',
 'cp1006',
 'cp1026',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'cp65001',
 'euc_jp',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_kr',
 'gb2312',
 'gbk',
 'gb18030',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'iso8859_16',
 'johab',
 'koi8_r',
 'koi8_u',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'ptcp154',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8',
 'utf_8_sig']

Python 3.4 (96 encodings)

['ascii',
 'big5',
 'big5hkscs',
 'cp037',
 'cp273',
 'cp424',
 'cp437',
 'cp500',
 'cp720',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp858',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp874',
 'cp875',
 'cp932',
 'cp949',
 'cp950',
 'cp1006',
 'cp1026',
 'cp1125',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'cp65001',
 'euc_jp',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_kr',
 'gb2312',
 'gbk',
 'gb18030',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_11',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'iso8859_16',
 'johab',
 'koi8_r',
 'koi8_u',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'ptcp154',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8',
 'utf_8_sig']

Python 3.5 (98 encodings)

['ascii',
 'big5',
 'big5hkscs',
 'cp037',
 'cp273',
 'cp424',
 'cp437',
 'cp500',
 'cp720',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp858',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp874',
 'cp875',
 'cp932',
 'cp949',
 'cp950',
 'cp1006',
 'cp1026',
 'cp1125',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'cp65001',
 'euc_jp',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_kr',
 'gb2312',
 'gbk',
 'gb18030',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_11',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'iso8859_16',
 'johab',
 'koi8_r',
 'koi8_t',
 'koi8_u',
 'kz1048',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'ptcp154',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8',
 'utf_8_sig']

Python 3.6 (98 encodings)

['ascii',
 'big5',
 'big5hkscs',
 'cp037',
 'cp273',
 'cp424',
 'cp437',
 'cp500',
 'cp720',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp858',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp874',
 'cp875',
 'cp932',
 'cp949',
 'cp950',
 'cp1006',
 'cp1026',
 'cp1125',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'cp65001',
 'euc_jp',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_kr',
 'gb2312',
 'gbk',
 'gb18030',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_11',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'iso8859_16',
 'johab',
 'koi8_r',
 'koi8_t',
 'koi8_u',
 'kz1048',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'ptcp154',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8',
 'utf_8_sig']

Python 3.7 (98 encodings)

['ascii',
 'big5',
 'big5hkscs',
 'cp037',
 'cp273',
 'cp424',
 'cp437',
 'cp500',
 'cp720',
 'cp737',
 'cp775',
 'cp850',
 'cp852',
 'cp855',
 'cp856',
 'cp857',
 'cp858',
 'cp860',
 'cp861',
 'cp862',
 'cp863',
 'cp864',
 'cp865',
 'cp866',
 'cp869',
 'cp874',
 'cp875',
 'cp932',
 'cp949',
 'cp950',
 'cp1006',
 'cp1026',
 'cp1125',
 'cp1140',
 'cp1250',
 'cp1251',
 'cp1252',
 'cp1253',
 'cp1254',
 'cp1255',
 'cp1256',
 'cp1257',
 'cp1258',
 'cp65001',
 'euc_jp',
 'euc_jis_2004',
 'euc_jisx0213',
 'euc_kr',
 'gb2312',
 'gbk',
 'gb18030',
 'hz',
 'iso2022_jp',
 'iso2022_jp_1',
 'iso2022_jp_2',
 'iso2022_jp_2004',
 'iso2022_jp_3',
 'iso2022_jp_ext',
 'iso2022_kr',
 'latin_1',
 'iso8859_2',
 'iso8859_3',
 'iso8859_4',
 'iso8859_5',
 'iso8859_6',
 'iso8859_7',
 'iso8859_8',
 'iso8859_9',
 'iso8859_10',
 'iso8859_11',
 'iso8859_13',
 'iso8859_14',
 'iso8859_15',
 'iso8859_16',
 'johab',
 'koi8_r',
 'koi8_t',
 'koi8_u',
 'kz1048',
 'mac_cyrillic',
 'mac_greek',
 'mac_iceland',
 'mac_latin2',
 'mac_roman',
 'mac_turkish',
 'ptcp154',
 'shift_jis',
 'shift_jis_2004',
 'shift_jisx0213',
 'utf_32',
 'utf_32_be',
 'utf_32_le',
 'utf_16',
 'utf_16_be',
 'utf_16_le',
 'utf_7',
 'utf_8',
 'utf_8_sig']

In case they're relevant to anyone's use case, note that the docs also list some Python-specific encodings, many of which seem to be primarily for use by Python's internals or are otherwise weird in some way, like the 'undefined' encoding which always throws an exception if you try to use it. You probably want to ignore these completely if, like the question-asker here, you're trying to figure out what encoding was used for some text you've come across in the real world. As of Python 3.7, the list is as follows:

["idna",
 "mbcs",
 "oem",
 "palmos",
 "punycode",
 "raw_unicode_escape",
 "rot_13",
 "undefined",
 "unicode_escape",
 "unicode_internal",
 "base64_codec",
 "bz2_codec",
 "hex_codec",
 "quopri_codec",
 "uu_codec",
 "zlib_codec"]

Some older Python versions had a string_escape special encoding that I've not included in the above list because it's been removed from the language.

Finally, in case you'd like to update my tables above for a newer version of Python, here's the (crude, not very robust) script I used to generate them:

import requests
import lxml.html
import pprint

for version, url in [
    ('2.3', 'https://docs.python.org/2.3/lib/node130.html'),
    ('2.4', 'https://docs.python.org/2.4/lib/standard-encodings.html'),
    ('2.5', 'https://docs.python.org/2.5/lib/standard-encodings.html'),
    ('2.6', 'https://docs.python.org/2.6/library/codecs.html#standard-encodings'),
    ('2.7', 'https://docs.python.org/2.7/library/codecs.html#standard-encodings'),
    ('3.0', 'https://docs.python.org/3.0/library/codecs.html#standard-encodings'),
    ('3.1', 'https://docs.python.org/3.1/library/codecs.html#standard-encodings'),
    ('3.2', 'https://docs.python.org/3.2/library/codecs.html#standard-encodings'),
    ('3.3', 'https://docs.python.org/3.3/library/codecs.html#standard-encodings'),
    ('3.4', 'https://docs.python.org/3.4/library/codecs.html#standard-encodings'),
    ('3.5', 'https://docs.python.org/3.5/library/codecs.html#standard-encodings'),
    ('3.6', 'https://docs.python.org/3.6/library/codecs.html#standard-encodings'),
    ('3.7', 'https://docs.python.org/3.7/library/codecs.html#standard-encodings'),
]:
    html = requests.get(url).text
    doc = lxml.html.fromstring(html)
    standard_encodings_table = doc.xpath(
        '//table[preceding::h2[.//text()[contains(., "Standard Encodings")]]][//th/text()="Codec"]'
    )[0]
    codecs = standard_encodings_table.xpath('.//td[1]/text()')
    print("## Python %s (%i encodings)" % (version, len(codecs)))
    print('<pre><code>' + pprint.pformat(codecs) + '</code></pre>')
23
votes

Maybe you should try using the Universal Encoding Detector (chardet) library instead of implementing it yourself.

>>> import chardet
>>> s = '\xe2\x98\x83' # ☃
>>> chardet.detect(s)
{'confidence': 0.505, 'encoding': 'utf-8'}
17
votes

You could use a technique to list all modules in the encodings package.

import pkgutil
import encodings

false_positives = set(["aliases"])

found = set(name for imp, name, ispkg in pkgutil.iter_modules(encodings.__path__) if not ispkg)
found.difference_update(false_positives)
print found
5
votes

I doubt there is such method/functionality in codecs module, but if you see encoding/__init__.py, search function searches thru encodings modules folder, so you may do the same e.g.

>>> os.listdir(os.path.dirname(encodings.__file__))
['cp500.pyc', 'utf_16_le.py', 'gb18030.py', 'mbcs.pyc', 'undefined.pyc', 'idna.pyc', 'punycode.pyc', 'cp850.py', 'big5hkscs.pyc', 'mac_arabic.py', '__init__.pyc', 'string_escape.py', 'hz.py', 'cp037.py', 'cp737.py', 'iso8859_5.pyc', 'iso8859_13.pyc', 'cp861.pyc', 'cp862.py', 'iso8859_9.pyc', 'cp949.py', 'base64_codec.pyc', 'koi8_r.py', 'iso8859_2.py', 'ptcp154.pyc', 'uu_codec.pyc', 'mac_croatian.pyc', 'charmap.pyc', 'iso8859_15.pyc', 'euc_jp.py', 'cp1250.py', 'iso8859_10.pyc', 'koi8_r.pyc', 'unicode_escape.pyc', 'cp863.pyc', 'iso8859_4.pyc', 'cp852.py', 'unicode_internal.py', 'big5hkscs.py', 'cp1257.pyc', 'cp1254.py', 'shift_jisx0213.py', 'shift_jis.pyc', 'cp869.pyc', 'hp_roman8.py', 'iso8859_4.py', 'cp775.py', 'cp1251.py', 'mac_cyrillic.pyc', 'mac_greek.pyc', 'mac_roman.pyc', 'iso8859_11.pyc', 'iso8859_6.py', 'utf_8_sig.py', 'iso8859_3.py', 'iso2022_jp_1.py', 'ascii.py', 'cp1026.pyc', 'cp1250.pyc', 'cp950.py', 'raw_unicode_escape.py', 'euc_jis_2004.pyc', 'cp775.pyc', 'euc_kr.py', 'mac
_greek.py', 'big5.pyc', 'shift_jis_2004.pyc', 'gbk.pyc', 'cp1254.pyc', 'cp1255.pyc', 'cp855.pyc', 'string_escape.pyc', 'cp949.pyc', 'cp1258.pyc', 'iso8859_3.pyc', 'mac_iceland.pyc', 'cp1251.pyc', 'cp860.py', 'cp856.py', 'cp874.py', 'iso2022_kr.py', 'cp856.pyc', 'rot_13.py', 'palmos.py', 'iso2022_jp_2.pyc', 'mac_farsi.py', 'koi8_u.pyc', 'cp1256.py', 'iso8859_10.py', 'tis_620.py', 'iso8859_14.pyc', 'cp1253.py', 'cp1258.py', 'cp437.py', 'cp862.pyc', 'mac_turkish.py', 'undefined.py', 'euc_kr.pyc', 'gb18030.pyc', 'aliases.pyc', 'iso8859_9.py', 'uu_codec.py', 'gbk.py', 'quopri_codec.pyc', 'iso8859_7.py', 'mac_iceland.py', 'iso8859_2.pyc', 'euc_jis_2004.py', 'iso2022_jp_3.pyc', 'cp874.pyc', '__init__.py', 'mac_roman.py', 'iso8859_16.py', 'cp866.py', 'unicode_internal.pyc', 'mac_turkish.pyc', 'johab.pyc', 'cp037.pyc', 'punycode.py', 'cp1253.pyc', 'euc_jisx0213.pyc', 'iso2022_jp_2004.pyc', 'iso2022_kr.pyc', 'zlib_codec.pyc', 'cp932.py', 'cp1255.py', 'iso2022_jp_1.pyc', 'cp857.pyc', 'cp424.pyc',
 'iso2022_jp_2.py', 'iso2022_jp.pyc', 'mbcs.py', 'utf_8.py', 'palmos.pyc', 'cp1252.pyc', 'aliases.py', 'quopri_codec.py', 'latin_1.pyc', 'iso2022_jp.py', 'zlib_codec.py', 'cp1026.py', 'cp860.pyc', 'cp1252.py', 'hex_codec.pyc', 'iso8859_1.pyc', 'cp850.pyc', 'cp861.py', 'iso8859_15.py', 'cp865.pyc', 'hp_roman8.pyc', 'iso8859_7.pyc', 'mac_latin2.py', 'iso8859_11.py', 'mac_centeuro.pyc', 'iso8859_6.pyc', 'ascii.pyc', 'mac_centeuro.py', 'iso2022_jp_3.py', 'bz2_codec.py', 'mac_arabic.pyc', 'euc_jisx0213.py', 'tis_620.pyc', 'shift_jis_2004.py', 'utf_8.pyc', 'cp855.py', 'mac_romanian.pyc', 'iso8859_8.py', 'cp869.py', 'ptcp154.py', 'utf_16_be.py', 'iso2022_jp_ext.pyc', 'bz2_codec.pyc', 'base64_codec.py', 'latin_1.py', 'charmap.py', 'hz.pyc', 'cp950.pyc', 'cp875.pyc', 'cp1006.pyc', 'utf_16.py', 'shift_jisx0213.pyc', 'cp424.py', 'cp932.pyc', 'iso8859_5.py', 'mac_romanian.py', 'utf_8_sig.pyc', 'iso8859_1.py', 'cp875.py', 'cp437.pyc', 'cp865.py', 'utf_7.py', 'utf_16_be.pyc', 'rot_13.pyc', 'euc_jp.p
yc', 'raw_unicode_escape.pyc', 'iso8859_8.pyc', 'utf_16.pyc', 'iso8859_14.py', 'iso8859_16.pyc', 'cp852.pyc', 'cp737.pyc', 'mac_croatian.py', 'mac_latin2.pyc', 'iso2022_jp_ext.py', 'cp1140.py', 'mac_cyrillic.py', 'cp1257.py', 'cp500.py', 'cp1140.pyc', 'shift_jis.py', 'unicode_escape.py', 'cp864.py', 'cp864.pyc', 'cp857.py', 'hex_codec.py', 'mac_farsi.pyc', 'idna.py', 'johab.py', 'utf_7.pyc', 'cp863.py', 'iso8859_13.py', 'koi8_u.py', 'gb2312.pyc', 'cp1256.pyc', 'cp866.pyc', 'iso2022_jp_2004.py', 'utf_16_le.pyc', 'gb2312.py', 'cp1006.py', 'big5.py']

but as anybody can register a codec, so that won't be exhaustive list.

3
votes

The Python source code has a script at Tools/unicode/listcodecs.py which lists all codecs.

Among the listed codecs, however, there are some that are not Unicode-to-byte converters, like base64_codec, quopri_codec and bz2_codec, as @John Machin pointed out.

2
votes

Here's a programmatic way to list all the encodings defined in the stdlib encodings package, note that this won't list user defined encodings. This combines some of the tricks in the other answers but actually produces a working list using the codec's canonical name.

import encodings
import pkgutil
import pprint


all_encodings = set()

for _, modname, _ in pkgutil.iter_modules(
        encodings.__path__, encodings.__name__ + '.',
):
    try:
        mod = __import__(modname, fromlist=[str('__trash')])
    except (ImportError, LookupError):
        # A few encodings are platform specific: mcbs, cp65001
        # print('skip {}'.format(modname))
        pass

    try:
        all_encodings.add(mod.getregentry().name)
    except AttributeError as e:
        # the `aliases` module doensn't actually provide a codec
        # print('skip {}'.format(modname))
        if 'regentry' not in str(e):
            raise

pprint.pprint(sorted(all_encodings))
2
votes

From Python 3.7.6 Source, under /Tools/unicode/listcodecs.py:

""" List all available codec modules.

(c) Copyright 2005, Marc-Andre Lemburg ([email protected]).

    Licensed to PSF under a Contributor Agreement.

"""

import os, codecs, encodings

_debug = 0

def listcodecs(dir):
    names = []
    for filename in os.listdir(dir):
        if filename[-3:] != '.py':
            continue
        name = filename[:-3]
        # Check whether we've found a true codec
        try:
            codecs.lookup(name)
        except LookupError:
            # Codec not found
            continue
        except Exception as reason:
            # Probably an error from importing the codec; still it's
            # a valid code name
            if _debug:
                print('* problem importing codec %r: %s' % \
                      (name, reason))
        names.append(name)
    return names


if __name__ == '__main__':
    names = listcodecs(encodings.__path__[0])
    names.sort()
    print('all_codecs = [')
    for name in names:
        print('    %r,' % name)
    print(']')

Then:

if str(response.encoding) is "undefined" or \
        str(response.encoding) not in names:
    do_something()  # like set default to utf_8 and execute
    pass
1
votes

Probably you can do this:

from encodings.aliases import aliases
print aliases.keys()