Obtaining the original bytes after decoding to unicode and back

Question

I have a byte string which I'm decoding to unicode in python using .decode('unicode-escape'). This returns a unicode string. Encoding this unicode string to obtain it in byte form again however returns a different byte string. Why is this, and how can I decode and encode in a way that preserves the original data?

Examples:

some_bytes = b'7Q\x82\xacqo\xbb\x0f\x03\x105\x93<\xebD\xbe\xde\xad\x82\xf9\xa6\x1cX\x01N\x8c\xff\x9e\x84\x1e\xa1\x97'

some_bytes.decode('unicode-escape')

yields: 7Q¬qo»5<ëD¾Þù¦XNÿ¡

some_bytes.decode('unicode-escape').encode()

yields: b'7Q\xc2\x82\xc2\xacqo\xc2\xbb\x0f\x03\x105\xc2\x93<\xc3\xabD\xc2\xbe\xc3\x9e\xc2\xad\xc2\x82\xc3\xb9\xc2\xa6\x1cX\x01N\xc2\x8c\xc3\xbf\xc2\x9e\xc2\x84\x1e\xc2\xa1\xc2\x97'

That’s… not what unicode-escape does; it’s for expressing a character string in a particular, old variety of Python literal. — Davis Herring
Oh, my mistake. Encoding with 'unicode-escape' again returns the original string. How can I properly decode bytes to unicode? — muke
You need to know what encoding it is in, and use that one. (Those bytes don’t look like any human language, so guessing it would be hard.) — Davis Herring

amol goel amol goel · Accepted Answer · 2019-09-15T12:32:28

xc2,xc3 refers to 00 in utf-8. For eg :For power 2, utf-8 is \xc2\xb2

So when you are encoding it is added before every code-point.

For more details, you can see below link

https://www.utf8-chartable.de/unicode-utf8-table.pl?start=128&number=128&utf8=string-literal&unicodeinhtml=hex

Obtaining the original bytes after decoding to unicode and back

1 Answers