I have a string like this from Wikipedia (https://en.wikipedia.org/wiki/Tyre,_Lebanon)
Tyre (Arabic: صور, Ṣūr; Phoenician: ????????????, Ṣur; Hebrew: צוֹר, Tsor; Tiberian Hebrew צֹר, Ṣōr; Akkadian: ????????, Ṣurru; Greek: Τύρος, Týros; Turkish: Sur; Latin: Tyrus, Armenian Տիր [Dir]), sometimes romanized as Sour, is a city in the South Governorate of Lebanon.
When this sentence is loaded from a file, its length is 262. When it is copied and pasted from Browser, it is 267.
My question is that I have an existing data pipeline in C# that recognizes the length as 266 (the copy-and-paste length above but default read-from-file in C#), but Python3 reads the C# text output file and considers it as length of 262. The issue is that the character indexing (e.g. s[10:20]) through these two encoding systems will be different and make the end-to-end algorithm fails at this type of cases.
It appears the underlying encoding is different, though they have the same appearance to human readers (only the different parts shown):
- Phoenician: \xf0\x90\xa4\x91\xf0\x90\xa4\x85\xf0\x90\xa4\x93, \xe1\xb9\xa2ur;
- Phoenician: \xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd, \xe1\xb9\xa2ur;
And
- Akkadian: \xf0\x92\x80\xab\xf0\x92\x8a\x92, \xe1\xb9\xa2urru;
- Akkadian: \xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd, \xe1\xb9\xa2urru;
Is there a way for Python to read the file using the later encoding of length 266? And how to detect/determine the proper encoding system from the utf-8 bytes above?
The full utf-8 encoding for each case is shown below for further investigation
From file
b'Tyre (Arabic: \xd8\xb5\xd9\x88\xd8\xb1\xe2\x80\x8e\xe2\x80\x8e, \xe1\xb9\xa2\xc5\xabr; Phoenician: \xf0\x90\xa4\x91\xf0\x90\xa4\x85\xf0\x90\xa4\x93, \xe1\xb9\xa2ur; Hebrew: \xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e, Tsor; Tiberian Hebrew \xd7\xa6\xd6\xb9\xd7\xa8\xe2\x80\x8e, \xe1\xb9\xa2\xc5\x8dr; Akkadian: \xf0\x92\x80\xab\xf0\x92\x8a\x92, \xe1\xb9\xa2urru; Greek: \xce\xa4\xcf\x8d\xcf\x81\xce\xbf\xcf\x82, T\xc3\xbdros; Turkish: Sur; Latin: Tyrus, Armenian \xd5\x8f\xd5\xab\xd6\x80 [Dir]), sometimes romanized as Sour, is a city in the South Governorate of Lebanon.'
From copy and paste
b'Tyre (Arabic: \xd8\xb5\xd9\x88\xd8\xb1\xe2\x80\x8e\xe2\x80\x8e, \xe1\xb9\xa2\xc5\xabr; Phoenician: \xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd, \xe1\xb9\xa2ur; Hebrew: \xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e, Tsor; Tiberian Hebrew \xd7\xa6\xd6\xb9\xd7\xa8\xe2\x80\x8e, \xe1\xb9\xa2\xc5\x8dr; Akkadian: \xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd, \xe1\xb9\xa2urru; Greek: \xce\xa4\xcf\x8d\xcf\x81\xce\xbf\xcf\x82, T\xc3\xbdros; Turkish: Sur; Latin: Tyrus, Armenian \xd5\x8f\xd5\xab\xd6\x80 [Dir]), sometimes romanized as Sour, is a city in the South Governorate of Lebanon.'
b'\xef\xbf\xbd'(which you have repeated a few times in these two examples) is the UTF-8 encoding of the replacement character ("�"). This character is often used for displaying invalid bytes in UTF-8, but apparently your browser substitutes some characters it doesn't want to deal with (for whatever reason). I'm pretty sure the version with the replacement characters is broken; now you need to find out why this happens in your C# pipeline. - lenz