Let's say I have a string in Python:
>>> s = 'python'
>>> len(s)
6
Now I encode
this string like this:
>>> b = s.encode('utf-8')
>>> b16 = s.encode('utf-16')
>>> b32 = s.encode('utf-32')
What I get from above operations is a bytes array -- that is, b
, b16
and b32
are just arrays of bytes (each byte being 8-bit long of course).
But we encoded the string. So, what does this mean? How do we attach the notion of "encoding" with the raw array of bytes?
The answer lies in the fact that each of these array of bytes is generated in a particular way. Let's look at these arrays:
>>> [hex(x) for x in b]
['0x70', '0x79', '0x74', '0x68', '0x6f', '0x6e']
>>> len(b)
6
This array indicates that for each character we have one byte (because all the characters fall below 127). Hence, we can say that "encoding" the string to 'utf-8' collects each character's corresponding code-point and puts it into the array. If the code point can not fit in one byte then utf-8 consumes two bytes. Hence utf-8 consumes least number of bytes possible.
>>> [hex(x) for x in b16]
['0xff', '0xfe', '0x70', '0x0', '0x79', '0x0', '0x74', '0x0', '0x68', '0x0', '0x6f', '0x0', '0x6e', '0x0']
>>> len(b16)
14 # (2 + 6*2)
Here we can see that "encoding to utf-16" first puts a two byte BOM (FF FE
) into the bytes array, and after that, for each character it puts two bytes into the array. (In our case, the second byte is always zero)
>>> [hex(x) for x in b32]
['0xff', '0xfe', '0x0', '0x0', '0x70', '0x0', '0x0', '0x0', '0x79', '0x0', '0x0', '0x0', '0x74', '0x0', '0x0', '0x0', '0x68', '0x0', '0x0', '0x0', '0x6f', '0x0', '0x0', '0x0', '0x6e', '0x0', '0x0', '0x0']
>>> len(b32)
28 # (2+ 6*4 + 2)
In the case of "encoding in utf-32", we first put the BOM, then for each character we put four bytes, and lastly we put two zero bytes into the array.
Hence, we can say that the "encoding process" collects 1 2 or 4 bytes (depending on the encoding name) for each character in the string and prepends and appends more bytes to them to create the final result array of bytes.
Now, my questions:
- Is my understanding of the encoding process correct or am I missing something?
- We can see that the memory representation of the variables
b
,b16
andb32
is actually a list of bytes. What is the memory representation of the string? Exactly what is stored in memory for a string? - We know that when we do an
encode()
, each character's corresponding code point is collected (code point corresponding to the encoding name) and put into an array or bytes. What exactly happens when we do adecode()
? - We can see that in utf-16 and utf-32, a BOM is prepended, but why are two zero bytes appended in the utf-32 encoding?
0x0 0x0 0xff 0xfe
is the BOM for big-endian as opposed to little-endian utf82). – Katriel