1) ThisisNotUnicodeString
is string literal. Despite ThisisNotUnicodeString
is not a unicode literal, Which encoding technique used to represent ThisisNotUnicodeString
in memory? Because there should be some encoding technique to represent 정 or 💛 character in memory.
In the interactive prompt, which encoding will be used to encode Python 2.X's str
type depends on your shell encoding, for example if you run the terminal under a Linux system with the encoding of the terminal being UTF-8:
>>> s = "a정정💛"
>>> s
'a\xec\xa0\x95\xec\xa0\x95\xf0\x9f\x92\x9b'
Now try to change the encoding from your terminal window to something else, in this case I've changed the shell's encoding from UTF-8 to WINDOWS-1250:
>>> s = "a???"
If you try this with a tty session you get a diamonds instead of ? at least under Ubuntu you may get different characters.
As you can conclude which encoding will be used to determine the encoding of str
in the interactive prompt is shell-dependent. This applies to code run interactively under Python interpreter, code that's not run interactively will raise an exception:
#main.py
s = "a정정💛"
Trying to run the code raises SynatxError
:
$ python main.py
SyntaxError: Non-ASCII character '\xec' in file main.py...
This is because Python 2.X uses ASCII by default:
>>> sys.getdefaultencoding()
'ascii'
Then, you have to specifiy the encoding explicity in your code by doing this:
#main.py
#*-*encoding:utf-8*-*
s = "a정정💛"
2) Which encoding technique used to represent unicode literal a in memory? utf-8? If yes, How to know the number of bytes occupied?
Keep in mind that the encoding scheme can differ if you run your code in different shells, I have tested this under Linux, this could be slightly different for Windows, so check your operating system's documentation.
To know the number of bytes occupied, use len
:
>>> s = "a정정💛"
>>> len(s)
11
s
occupies exactly 11 bytes.
2) Which encoding technique used to represent unicode literal a
in memory? utf-8? If yes, How to know the number of bytes occupied?
Well, it's a confusion, unicode
type does not have encoding. It's just a sequence of Unicode character points (a.k.a U+0040 for Commercial At).
3) Why c
is not represented in memory, using utf-16 technique?
UTF-8 is an encoding scheme that's different from UTF-16--UTF-8 represents characters' bytes differently from that of UTF-16. Here:
>>> c = unicode('a정정💛', 'utf-16')
You're essentially doing this:
>>> "a정정💛"
'a\xec\xa0\x95\xec\xa0\x95\xf0\x9f\x92\x9b'
>>> unicode('a\xec\xa0\x95\xec\xa0\x95\xf0\x9f\x92\x9b', 'utf-16')
UnicodeDecodeError: 'utf16' codec can't decode byte 0x9b in position 10: truncated data
This is because you're trying to decode UTF-8 with UTF-16. Again, both use different number of bytes to represent characters, they're just two different encoding schemes--different ways to represent characters in bytes.
For your reference:
Python str vs unicode types
a = u'a정정💛'
is decoded from whatever the terminal encoding is. Seesys.stdin.encoding
. We know the terminal encoding is UTF-8 because subsequentlyb = unicode('a정정💛', 'utf-8')
succeeds.c = unicode('a정정💛', 'utf-16')
thus fails for the obvious reason that a UTF-8 byte string can't be decoded as UTF-16. The two encodings are nothing alike. – Eryk Sununicode
depends on the build. Python 2 on Windows and some Unix systems uses a narrow build that's internally something like UTF-16, but broken for non-BMP strings because it counts a surrogate pair as two characters in the string length. Most Unix systems use a wide build, which stores each Unicode ordinal as a 4-byte integer. – Eryk Sun