Python C# - Unicode character is not the same on Python and C#

Question

I encountered with a problem while working on text files. I found that the character Unicode representation on Python and C# is different.

While opening the file with Python 3.5.2 on specific index the unicode character is:

with open('file.txt', 'r', encoding = 'utf-8') as f:
     text = f.read()
text[189]

// Output: u"\U0001F464"

While opening the file with C# on the same index this char is represented by two characters:

string text = File.ReadAllText("file.txt", Encoding.UTF8);
Console.WriteLine(((int)text[189]).ToString("X4")); 

// Output: "D83D"

string text = File.ReadAllText("file.txt", Encoding.UTF8);
Console.WriteLine(((int)text[190]).ToString("X4")); 

// Output: "DC64"

So on python this char is on index 189 and on c# its on 189 and 190.

Reference to this charecter on fileformat website:

http://www.fileformat.info/info/unicode/char/1F464/index.htm

As you can see there, the representation of this charecter has a different length. On C#/C/C++/Java "\uD83D\uDC64" and on python u"\U0001F464".

The part of the text that is problematic:

???? Sign in

Is there a way to use the same unicode representation in Python 3.5 and C#?

Edit:

Download of the original file in which this error happend: https://ufile.io/pr5v6

"As you can see there, the representation of this charecter has a different length. On C#/C/C++/Java 2 chars and on python 1 char" not true, I think you need to read on what UTF-8 is and then fix the problem where it actually happens, which is not while opening the file. Refer to the XY Problem to understand why I say this. — Iharob Al Asimi
You can try it yourself. I checked it enough times before i posted it. I will be adding the original file in a second. — Montoya
You would be surprised by the fact that "checking" code and verifying that it "works" in general doesn't really mean that it does. — Iharob Al Asimi
Related reading : The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) — Pac0

Mark Tolonen Mark Tolonen · Accepted Answer · 2017-08-30T15:41:58

You can't fix it. It is inherent in the Unicode implementation of the languages.

When reading a file and decoding to Unicode, C# and Java store Unicode strings internally encoded as UTF-16. Code points outside the basic multilingual plane (BMP, U+0000 to U+FFFF) use surrogates (two words) to represent a Unicode code point. The fact that you can see a Unicode code point as two words is a leaky abstraction.

Python 3.3+ hides this abstraction. It internally uses 1-, 2- or 4-byte encodings as needed to represent a Unicode string, but presents only the Unicode code points to the user.

Python 2 (same leaky abstraction as C# and Java):

Python 2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:53:40) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> len(u'\U0001F464')
2
>>> u'\U0001F464'[0]
u'\ud83d'
>>> u'\U0001F464'[1]
u'\udc64'

Python 3.3+:

Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 18:41:36) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> len(u'\U0001F464')
1
>>> u'\U0001F464'[0]
'👤'

Internally, Python 3 uses UTF-32 to store a Unicode string containing a non-BMP code point and would use four bytes to store U+1F464.

Python C# - Unicode character is not the same on Python and C#

1 Answers