2
votes

I have a string like this from Wikipedia (https://en.wikipedia.org/wiki/Tyre,_Lebanon)

Tyre (Arabic: صور‎‎, Ṣūr; Phoenician: ????????????, Ṣur; Hebrew: צוֹר‎, Tsor; Tiberian Hebrew צֹר‎, Ṣōr; Akkadian: ????????, Ṣurru; Greek: Τύρος, Týros; Turkish: Sur; Latin: Tyrus, Armenian Տիր [Dir]), sometimes romanized as Sour, is a city in the South Governorate of Lebanon.

When this sentence is loaded from a file, its length is 262. When it is copied and pasted from Browser, it is 267.

My question is that I have an existing data pipeline in C# that recognizes the length as 266 (the copy-and-paste length above but default read-from-file in C#), but Python3 reads the C# text output file and considers it as length of 262. The issue is that the character indexing (e.g. s[10:20]) through these two encoding systems will be different and make the end-to-end algorithm fails at this type of cases.

It appears the underlying encoding is different, though they have the same appearance to human readers (only the different parts shown):

  • Phoenician: \xf0\x90\xa4\x91\xf0\x90\xa4\x85\xf0\x90\xa4\x93, \xe1\xb9\xa2ur;
  • Phoenician: \xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd, \xe1\xb9\xa2ur;

And

  • Akkadian: \xf0\x92\x80\xab\xf0\x92\x8a\x92, \xe1\xb9\xa2urru;
  • Akkadian: \xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd, \xe1\xb9\xa2urru;

Is there a way for Python to read the file using the later encoding of length 266? And how to detect/determine the proper encoding system from the utf-8 bytes above?

The full utf-8 encoding for each case is shown below for further investigation

From file

b'Tyre (Arabic: \xd8\xb5\xd9\x88\xd8\xb1\xe2\x80\x8e\xe2\x80\x8e, \xe1\xb9\xa2\xc5\xabr; Phoenician: \xf0\x90\xa4\x91\xf0\x90\xa4\x85\xf0\x90\xa4\x93, \xe1\xb9\xa2ur; Hebrew: \xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e, Tsor; Tiberian Hebrew \xd7\xa6\xd6\xb9\xd7\xa8\xe2\x80\x8e, \xe1\xb9\xa2\xc5\x8dr; Akkadian: \xf0\x92\x80\xab\xf0\x92\x8a\x92, \xe1\xb9\xa2urru; Greek: \xce\xa4\xcf\x8d\xcf\x81\xce\xbf\xcf\x82, T\xc3\xbdros; Turkish: Sur; Latin: Tyrus, Armenian \xd5\x8f\xd5\xab\xd6\x80 [Dir]), sometimes romanized as Sour, is a city in the South Governorate of Lebanon.'

From copy and paste

b'Tyre (Arabic: \xd8\xb5\xd9\x88\xd8\xb1\xe2\x80\x8e\xe2\x80\x8e, \xe1\xb9\xa2\xc5\xabr; Phoenician: \xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd, \xe1\xb9\xa2ur; Hebrew: \xd7\xa6\xd7\x95\xd6\xb9\xd7\xa8\xe2\x80\x8e, Tsor; Tiberian Hebrew \xd7\xa6\xd6\xb9\xd7\xa8\xe2\x80\x8e, \xe1\xb9\xa2\xc5\x8dr; Akkadian: \xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd\xef\xbf\xbd, \xe1\xb9\xa2urru; Greek: \xce\xa4\xcf\x8d\xcf\x81\xce\xbf\xcf\x82, T\xc3\xbdros; Turkish: Sur; Latin: Tyrus, Armenian \xd5\x8f\xd5\xab\xd6\x80 [Dir]), sometimes romanized as Sour, is a city in the South Governorate of Lebanon.'

2
My guess is that Python and Javascript have different ideas about which codepoints get combined into a single character. - Beefster
b'\xef\xbf\xbd' (which you have repeated a few times in these two examples) is the UTF-8 encoding of the replacement character ("�"). This character is often used for displaying invalid bytes in UTF-8, but apparently your browser substitutes some characters it doesn't want to deal with (for whatever reason). I'm pretty sure the version with the replacement characters is broken; now you need to find out why this happens in your C# pipeline. - lenz
@Beefster As long as there is a way to enforce the encoding, it should be consistent. Is there anyway like that? - Yo Hsiao
@Beefster Javascript? Why Javascript? - lenz
@lenz or just the browser. Either way. - Beefster

2 Answers

1
votes

You probably don't have Phoenician fonts installed in your system, so the web browser (as @lenz mentioned in the comment) displays characters 𐤓 instead. Python loads your string properly.

There are 5 problematic characters in the text: 3 Phoenician and 2 Akkadian:

(I omit the Akkadian ones.)

Each of those letters is replaced in your encodings by \xef\xbf\xbd\xef\xbf\xbd that correspond to ��.

Each problematic letter somehow gets replaced by two signs, so the total length of the string increases by 5, from 262 to 267 characters.

0
votes

It turns out I found a different viewpoint to answer this question. C# does report longer length of a string, but it does not mean it is incorrect, just the underlying encoding system is different and has its limitation.

http://illegalargumentexception.blogspot.com/2010/04/i18n-comparing-character-encoding-in-c.html

Python C# - Unicode character is not the same on Python and C#

When reading a file and decoding to Unicode, C# and Java store Unicode strings internally encoded as UTF-16. Code points outside the basic multilingual plane (BMP, U+0000 to U+FFFF) use surrogates (two words) to represent a Unicode code point. The fact that you can see a Unicode code point as two words is a leaky abstraction.

Python 3.3+ hides this abstraction. It internally uses 1-, 2- or 4-byte encodings as needed to represent a Unicode string, but presents only the Unicode code points to the user.

This explains that why the lengths reported by C# can be longer than Python.

How to make them congruent? hmmm... probably not directly but through a substring search as a post-processing...