String lengths differ in Python3 from file and through copy-and-paste

Question

I have a string like this from Wikipedia (https://en.wikipedia.org/wiki/Tyre,_Lebanon)

Tyre (Arabic: صور‎‎, Ṣūr; Phoenician: ????????????, Ṣur; Hebrew: צוֹר‎, Tsor; Tiberian Hebrew צֹר‎, Ṣōr; Akkadian: ????????, Ṣurru; Greek: Τύρος, Týros; Turkish: Sur; Latin: Tyrus, Armenian Տիր [Dir]), sometimes romanized as Sour, is a city in the South Governorate of Lebanon.

When this sentence is loaded from a file, its length is 262. When it is copied and pasted from Browser, it is 267.

My question is that I have an existing data pipeline in C# that recognizes the length as 266 (the copy-and-paste length above but default read-from-file in C#), but Python3 reads the C# text output file and considers it as length of 262. The issue is that the character indexing (e.g. s[10:20]) through these two encoding systems will be different and make the end-to-end algorithm fails at this type of cases.

It appears the underlying encoding is different, though they have the same appearance to human readers (only the different parts shown):

Phoenician: \xf0\x90\xa4\x91\xf0\x90\xa4\x85\xf0\x90\xa4\x93, \xe1\xb9\xa2ur;
Phoenician: \xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd, \xe1\xb9\xa2ur;

And

Akkadian: \xf0\x92\x80\xab\xf0\x92\x8a\x92, \xe1\xb9\xa2urru;
Akkadian: \xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd, \xe1\xb9\xa2urru;

Is there a way for Python to read the file using the later encoding of length 266? And how to detect/determine the proper encoding system from the utf-8 bytes above?

The full utf-8 encoding for each case is shown below for further investigation

From file

b'Tyre (Arabic: \xd8\xb5\xd9\x88\xd8\xb1\xe2\x80\x8e\xe2\x80\x8e, \xe1\xb9\xa2\xc5\xabr; Phoenician: \xf0\x90\xa4\x91\xf0\x90\xa4\x85\xf0\x90\xa4\x93, \xe1\xb9\xa2ur; Hebrew: \xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e, Tsor; Tiberian Hebrew \xd7\xa6\xd6\xb9\xd7\xa8\xe2\x80\x8e, \xe1\xb9\xa2\xc5\x8dr; Akkadian: \xf0\x92\x80\xab\xf0\x92\x8a\x92, \xe1\xb9\xa2urru; Greek: \xce\xa4\xcf\x8d\xcf\x81\xce\xbf\xcf\x82, T\xc3\xbdros; Turkish: Sur; Latin: Tyrus, Armenian \xd5\x8f\xd5\xab\xd6\x80 [Dir]), sometimes romanized as Sour, is a city in the South Governorate of Lebanon.'

From copy and paste

b'Tyre (Arabic: \xd8\xb5\xd9\x88\xd8\xb1\xe2\x80\x8e\xe2\x80\x8e, \xe1\xb9\xa2\xc5\xabr; Phoenician: \xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd, \xe1\xb9\xa2ur; Hebrew: \xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e, Tsor; Tiberian Hebrew \xd7\xa6\xd6\xb9\xd7\xa8\xe2\x80\x8e, \xe1\xb9\xa2\xc5\x8dr; Akkadian: \xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd, \xe1\xb9\xa2urru; Greek: \xce\xa4\xcf\x8d\xcf\x81\xce\xbf\xcf\x82, T\xc3\xbdros; Turkish: Sur; Latin: Tyrus, Armenian \xd5\x8f\xd5\xab\xd6\x80 [Dir]), sometimes romanized as Sour, is a city in the South Governorate of Lebanon.'

My guess is that Python and Javascript have different ideas about which codepoints get combined into a single character. — Beefster
b'\xef\xbf\xbd' (which you have repeated a few times in these two examples) is the UTF-8 encoding of the replacement character ("�"). This character is often used for displaying invalid bytes in UTF-8, but apparently your browser substitutes some characters it doesn't want to deal with (for whatever reason). I'm pretty sure the version with the replacement characters is broken; now you need to find out why this happens in your C# pipeline. — lenz
@Beefster As long as there is a way to enforce the encoding, it should be consistent. Is there anyway like that? — Yo Hsiao

Radek Radek · Accepted Answer · 2017-12-18T22:20:51

You probably don't have Phoenician fonts installed in your system, so the web browser (as @lenz mentioned in the comment) displays characters 𐤓 instead. Python loads your string properly.

There are 5 problematic characters in the text: 3 Phoenician and 2 Akkadian:

The first character of the problematic part with Phoenician symbols is 'Phoenician Letter Sade' (https://unicode-table.com/en/10911/) -- it spans 4 bytes in UTF-8: F0 90 A4 91
It is followed with 'Phoenician Letter Wau' (https://unicode-table.com/en/10905/) -- again 4 bytes: F0 90 A4 85
The third letter if 'Phoenician Letter Rosh' (https://unicode-table.com/en/10913/) -- is uses 4 bytes as well: F0 90 A4 93

(I omit the Akkadian ones.)

Each of those letters is replaced in your encodings by \xef\xbf\xbd\xef\xbf\xbd that correspond to ��.

Each problematic letter somehow gets replaced by two � signs, so the total length of the string increases by 5, from 262 to 267 characters.

String lengths differ in Python3 from file and through copy-and-paste

From file

From copy and paste

2 Answers