0
votes

I'm trying to figure out how non-ascii characters get saved in ascii files. For example, if I open notepad ++ and set encoding to UTF-8 and then write שלום it will save it as 11 bites. 3 for BOM mark and two for each character. (I added | before and after each byte)

|239||187||191||215||169||215||156||215||149||215||157|

I can look up these values and figure out what letter they are referring to. E.g. http://utf8-chartable.de/unicode-utf8-table.pl?start=1408&number=128&utf8=dec

if I open a new file and set encoding to ASCII and write the same word. It will save 4 bites:

|249||236||229||237|

if I open the ASCII file it will correctly show me the hebrew word that I typed. How does it know? Is there a similar reference as the one for unicode?

2
That will only work if your machine is configured with the correct system code page.Hans Passant
thanks Hans, I found these codes in the Windows-1255 code page that you specified. Though I don't think it has to do with the system code page. running chcp in a command promt returns : "Active code page: 862" In any case I was more concerned with being able to look up these bite codes so you really answered my question. I don't think a comment can be marked as an answer. Perhaps copy your comment into an answer and that way you can get a proper credit.Mordechai
Console mode apps use a legacy MS-Dos code page, like 862. Native Windows apps, like Notepad++, use the system code page. This is all ancient history that doesn't deserve to ever come back. Unicode is the norm today.Hans Passant
@MorDeror: codepage 862 is the MS-DOS codepage used for Hebrew, but it does not appear to implement Hebrew glyphs the same way that Windows codepages 1255 and 28598 do.Remy Lebeau

2 Answers

2
votes

Only Unicode characters U+0000...U+007F can get encoded in Ascii, in a trivial manner.

Notepad++ does not have Ascii as an encoding. Instead, it has “ANSI”, which is a misnomer for a collection of encodings, typically 8-bit encodings. Simply do not use them. Use UTF-8 instead.

What happens in your case is probably that in your environment, “ANSI” is taken as an 8-bit Latin/Hebrew encoding, where code numbers outside the Ascii range denote Hebrew letters. This works up to a point, but not across systems and programs.

1
votes

The Hebrew characters you have shown are Unicode codepoints U+05E9, U+05DC, U+05D5, and U+05DD. There is no possible way those codepoints will fit in ASCII, their values are too large. The only way they could be getting saved to file as byte octets 0xF9 0xEC 0xE5 0xED (respectively) is if they are being encoded using the ISO-8859-8 charset (implemented in Windows in codepages 1255 and 28598). And the only way such a file would be displayed correctly is if it is interpreted using that same charset. If you are not doing anything special to tell the OS to use that specific charset for that file, then your OS must be set to use Hebrew as its default language, and that charset is its default charset for handling ANSI (not ASCII) data.