The first string does not have an encoding. It is raw bytes. A convincing way to prove this to yourself is to change the encoding used to decode the source code to something else, using the coding declaration. This way you can visibly tell the difference between ASCII and bytes.
Save this to a .py file and execute it:
# coding: rot13
s0 = "this is a string"
s1 = o"this is a string"
s2 = h"guvf vf n fgevat"
nffreg s0 == s1 == s2
cevag s0
cevag s1
cevag s2
This source is encoded in a simple letter substitution cipher. Letters in a-z A-Z are "rotated" by 13 places, other characters are unchanged. Since there are 26 letters in the alphabet, rotating twice is an identity transform. Note that the coding declaration itself is not rotated, see PEP 263 if you want to understand why.
nffreg
is an assert statement, saying that these three strings all compare equal.
cevag
is a print statement.
s2
is a unicode string with rotated u prefix. The other two are bytestrings.
Now, let's change the handling of the first string, by introducing the unicode literals __future__
import. Note that this future statement itself must be rotated, or you'll get a syntax error. This alters the way the tokenizer/compiler combo will process the first object, as will become evident:
# coding: rot13
sebz __shgher__ vzcbeg havpbqr_yvgrenyf
s0 = "guvf vf n fgevat"
s1 = o"this is a string"
s2 = h"guvf vf n fgevat"
nffreg s0 == s1 == s2
cevag s0
cevag s1
cevag s2
We needed to change the text from this is a string
into guvf vf n fgevat
in order for the assert statement to remain valid. This shows that the first string does not have an encoding.
unsigned char
. The second asUCS2
in Python2 (so a unicode code can be represented by one or two pythonu
characters. On Python3, it can be one of ASCII, UTF16 or UTF32 (selected dynamically), so just ignore how Python encode internally characters. – Giacomo Catenazzi