String literal Vs Unicode literal Vs unicode type object - Memory representation

Question

Unicode string is a sequence of code points

Unicode strings are expressed as instances of the unicode type

>>> ThisisNotUnicodeString = 'a정정????' # What is the memory representation?
>>> ThisisNotUnicodeString
'a\xec\xa0\x95\xec\xa0\x95\xf0\x9f\x92\x9b'
>>> type(ThisisNotUnicodeString)
<type 'str'>
>>> a = u'a정정????' # Which encoding technique used to represent in memory? utf-8?
>>> a
u'a\uc815\uc815\U0001f49b'
>>> type(a)
<type 'unicode'>
>>> b = unicode('a정정????', 'utf-8')
>>> b
u'a\uc815\uc815\U0001f49b'
>>> c = unicode('a정정????', 'utf-16')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_16.py", line 16, in decode
    return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x9b in position 10: truncated data
>>>

Question:

1) ThisisNotUnicodeString is string literal. Despite ThisisNotUnicodeString is not a unicode literal, Which encoding technique used to represent ThisisNotUnicodeString in memory? Because there should be some encoding technique to represent 정 or ???? character in memory.

2) Which encoding technique used to represent unicode literal a in memory? utf-8? If yes, How to know the number of bytes occupied?

3) Why c is not represented in memory, using utf-16 technique?

This might be saner not typed into some console but in a source file with a specified encoding which you then use. — pvg
a = u'a정정💛' is decoded from whatever the terminal encoding is. See sys.stdin.encoding. We know the terminal encoding is UTF-8 because subsequently b = unicode('a정정💛', 'utf-8') succeeds. c = unicode('a정정💛', 'utf-16') thus fails for the obvious reason that a UTF-8 byte string can't be decoded as UTF-16. The two encodings are nothing alike. — Eryk Sun
The internal format for unicode depends on the build. Python 2 on Windows and some Unix systems uses a narrow build that's internally something like UTF-16, but broken for non-BMP strings because it counts a surrogate pair as two characters in the string length. Most Unix systems use a wide build, which stores each Unicode ordinal as a 4-byte integer. — Eryk Sun

Maroun Maroun · Accepted Answer · 2017-06-04T07:04:23

Which encoding technique used to represent in memory? utf-8?

You can try the following:

ThisisNotUnicodeString.decode('utf-8')

If you get a result, it's UTF-8, otherwise it's not.

If you want to get the UTF-16 representation of the string, you should first decode it, and then encode with UTF-16 scheme:

ThisisNotUnicodeString.decode('utf-8').encode('utf-16')

So basically, you can decode and encode the given string from/to UTF-8/UTF-16, because all characters can be represented in both schemes.

ThisisNotUnicodeString.decode('utf-8').encode('utf-16').decode('utf-16').encode('utf-8')

String literal Vs Unicode literal Vs unicode type object - Memory representation

2 Answers