1
votes

I have a byte string which I'm decoding to unicode in python using .decode('unicode-escape'). This returns a unicode string. Encoding this unicode string to obtain it in byte form again however returns a different byte string. Why is this, and how can I decode and encode in a way that preserves the original data?

Examples:

some_bytes = b'7Q\x82\xacqo\xbb\x0f\x03\x105\x93<\xebD\xbe\xde\xad\x82\xf9\xa6\x1cX\x01N\x8c\xff\x9e\x84\x1e\xa1\x97'

some_bytes.decode('unicode-escape')

yields: 7Q¬qo»5<ëD¾Þ­ù¦XNÿ¡

some_bytes.decode('unicode-escape').encode()

yields: b'7Q\xc2\x82\xc2\xacqo\xc2\xbb\x0f\x03\x105\xc2\x93<\xc3\xabD\xc2\xbe\xc3\x9e\xc2\xad\xc2\x82\xc3\xb9\xc2\xa6\x1cX\x01N\xc2\x8c\xc3\xbf\xc2\x9e\xc2\x84\x1e\xc2\xa1\xc2\x97'

1
That’s… not what unicode-escape does; it’s for expressing a character string in a particular, old variety of Python literal.Davis Herring
Oh, my mistake. Encoding with 'unicode-escape' again returns the original string. How can I properly decode bytes to unicode?muke
You need to know what encoding it is in, and use that one. (Those bytes don’t look like any human language, so guessing it would be hard.)Davis Herring

1 Answers

0
votes

xc2,xc3 refers to 00 in utf-8. For eg :For power 2, utf-8 is \xc2\xb2

So when you are encoding it is added before every code-point.

For more details, you can see below link

https://www.utf8-chartable.de/unicode-utf8-table.pl?start=128&number=128&utf8=string-literal&unicodeinhtml=hex