1
votes

Python 2.x doc says,

Unicode string is a sequence of code points

Unicode strings are expressed as instances of the unicode type

>>> ThisisNotUnicodeString = 'a정정????' # What is the memory representation?
>>> ThisisNotUnicodeString
'a\xec\xa0\x95\xec\xa0\x95\xf0\x9f\x92\x9b'
>>> type(ThisisNotUnicodeString)
<type 'str'>
>>> a = u'a정정????' # Which encoding technique used to represent in memory? utf-8?
>>> a
u'a\uc815\uc815\U0001f49b'
>>> type(a)
<type 'unicode'>
>>> b = unicode('a정정????', 'utf-8')
>>> b
u'a\uc815\uc815\U0001f49b'
>>> c = unicode('a정정????', 'utf-16')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/encodings/utf_16.py", line 16, in decode
    return codecs.utf_16_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x9b in position 10: truncated data
>>> 

Question:

1) ThisisNotUnicodeString is string literal. Despite ThisisNotUnicodeString is not a unicode literal, Which encoding technique used to represent ThisisNotUnicodeString in memory? Because there should be some encoding technique to represent or ???? character in memory.

2) Which encoding technique used to represent unicode literal a in memory? utf-8? If yes, How to know the number of bytes occupied?

3) Why c is not represented in memory, using utf-16 technique?

2
What do you mean by "memory representation"?Maroun
This might be saner not typed into some console but in a source file with a specified encoding which you then use.pvg
a = u'a정정💛' is decoded from whatever the terminal encoding is. See sys.stdin.encoding. We know the terminal encoding is UTF-8 because subsequently b = unicode('a정정💛', 'utf-8') succeeds. c = unicode('a정정💛', 'utf-16') thus fails for the obvious reason that a UTF-8 byte string can't be decoded as UTF-16. The two encodings are nothing alike.Eryk Sun
The internal format for unicode depends on the build. Python 2 on Windows and some Unix systems uses a narrow build that's internally something like UTF-16, but broken for non-BMP strings because it counts a surrogate pair as two characters in the string length. Most Unix systems use a wide build, which stores each Unicode ordinal as a 4-byte integer.Eryk Sun
@eryksun it's never UTF-16. UCS-2 or UCS-4.pvg

2 Answers

1
votes

Which encoding technique used to represent in memory? utf-8?

You can try the following:

ThisisNotUnicodeString.decode('utf-8')

If you get a result, it's UTF-8, otherwise it's not.

If you want to get the UTF-16 representation of the string, you should first decode it, and then encode with UTF-16 scheme:

ThisisNotUnicodeString.decode('utf-8').encode('utf-16')

So basically, you can decode and encode the given string from/to UTF-8/UTF-16, because all characters can be represented in both schemes.

ThisisNotUnicodeString.decode('utf-8').encode('utf-16').decode('utf-16').encode('utf-8')
2
votes

1) ThisisNotUnicodeString is string literal. Despite ThisisNotUnicodeString is not a unicode literal, Which encoding technique used to represent ThisisNotUnicodeString in memory? Because there should be some encoding technique to represent 정 or 💛 character in memory.

In the interactive prompt, which encoding will be used to encode Python 2.X's str type depends on your shell encoding, for example if you run the terminal under a Linux system with the encoding of the terminal being UTF-8:

>>> s = "a정정💛"
>>> s
'a\xec\xa0\x95\xec\xa0\x95\xf0\x9f\x92\x9b' 

Now try to change the encoding from your terminal window to something else, in this case I've changed the shell's encoding from UTF-8 to WINDOWS-1250:

 >>> s = "a???"

If you try this with a tty session you get a diamonds instead of ? at least under Ubuntu you may get different characters.

As you can conclude which encoding will be used to determine the encoding of str in the interactive prompt is shell-dependent. This applies to code run interactively under Python interpreter, code that's not run interactively will raise an exception:

#main.py
s = "a정정💛"

Trying to run the code raises SynatxError:

$ python main.py
SyntaxError: Non-ASCII character '\xec' in file main.py...

This is because Python 2.X uses ASCII by default:

>>> sys.getdefaultencoding()
'ascii'

Then, you have to specifiy the encoding explicity in your code by doing this:

#main.py
#*-*encoding:utf-8*-*
s = "a정정💛"

2) Which encoding technique used to represent unicode literal a in memory? utf-8? If yes, How to know the number of bytes occupied?

Keep in mind that the encoding scheme can differ if you run your code in different shells, I have tested this under Linux, this could be slightly different for Windows, so check your operating system's documentation.

To know the number of bytes occupied, use len:

>>> s = "a정정💛"
>>> len(s)
11

s occupies exactly 11 bytes.

2) Which encoding technique used to represent unicode literal a in memory? utf-8? If yes, How to know the number of bytes occupied?

Well, it's a confusion, unicode type does not have encoding. It's just a sequence of Unicode character points (a.k.a U+0040 for Commercial At).

3) Why c is not represented in memory, using utf-16 technique?

UTF-8 is an encoding scheme that's different from UTF-16--UTF-8 represents characters' bytes differently from that of UTF-16. Here:

>>> c = unicode('a정정💛', 'utf-16')

You're essentially doing this:

>>> "a정정💛"
'a\xec\xa0\x95\xec\xa0\x95\xf0\x9f\x92\x9b'
>>> unicode('a\xec\xa0\x95\xec\xa0\x95\xf0\x9f\x92\x9b', 'utf-16')
UnicodeDecodeError: 'utf16' codec can't decode byte 0x9b in position 10: truncated data

This is because you're trying to decode UTF-8 with UTF-16. Again, both use different number of bytes to represent characters, they're just two different encoding schemes--different ways to represent characters in bytes.

For your reference: Python str vs unicode types