36
votes

I'm trying to get the first char of a byte-string in python 3.4, but when I index it, I get an int:

>>> my_bytes = b'just a byte string'
b'just a byte string'
>>> my_bytes[0]
106
>>> type(my_bytes[0])
<class 'int'>

This seems unintuitive to me, as I was expecting to get b'j'.

I have discovered that I can get the value I expect, but it feels like a hack to me.

>>> my_bytes[0:1]
b'j'

Can someone please explain why this happens?

1
The hack of using a range like my_bytes[0:1] really helped me write Python2/Python3 compatible code. I'd love to see an answer that covers the best practice for compatible code addressing this issue. For example: ord(my_bytes[0]) gives an int in Python2, yet my_bytes[0] gives an int in Python3. To work in both, I'm using ord(my_bytes[0:1]) which seems really ugly for Python3. - proximous
you answer helped me, I couldn't find the best approach to work with bytes and avoid the integer conversion when accessing an index, thanks. - Bersan
I noticed the same phenomena with lists made from bytearray and bytestring. type(list(b'abctest').pop(0)) give <class 'int'>. type(list(bytearray(b'abctest')).pop(0)) give <class 'int'>. type(bytearray(b'abctest').pop(0)) give <class 'int'>. - Valentin Stoykov

1 Answers

31
votes

The bytes type is a Binary Sequence type, and is explicitly documented as containing a sequence of integers in the range 0 to 255.

From the documentation:

Bytes objects are immutable sequences of single bytes.

[...]

While bytes literals and representations are based on ASCII text, bytes objects actually behave like immutable sequences of integers, with each value in the sequence restricted such that 0 <= x < 256[.]

[...]

Since bytes objects are sequences of integers (akin to a tuple), for a bytes object b, b[0] will be an integer, while b[0:1] will be a bytes object of length 1. (This contrasts with text strings, where both indexing and slicing will produce a string of length 1).

Bold emphasis mine. Note than indexing a string is a bit of an exception among the sequence types; 'abc'[0] gives you a str object of length one; str is the only sequence type that contains elements of its own type, always.

This echoes how other languages treat string data; in C the unsigned char type is also effectively an integer in the range 0-255. Many C compilers default to unsigned if you use an unqualified char type, and text is modelled as a char[] array.