3
votes

I have some UTF-8 text in a file utf8.txt. The file contains some characters that are outside the ASCII range. I tried the following code:

var fname = "utf8.txt";
var enc = Encoding.GetEncoding("ISO-8859-1", EncoderFallback.ExceptionFallback,
    DecoderFallback.ExceptionFallback);
var s = System.IO.File.ReadAllText(fname, enc);

The expected behavior is that the code should throw an exception, since it is not valid ISO-8859-1 text. Instead, the behavior is that it correctly decodes the UTF-8 text into the right characters (it looks correct in the debugger).

Is this a bug in .Net?

EDIT:

The file I tested with originally was UTF-8 with BOM. If I remove the BOM, the behavior changes. It still does not throw an exception, however it produces an incorrect Unicode string (the string does not look correct in the debugger).

EDIT:

To produce my test file, run the following code:

var fname = "utf8.txt";
var utf8_bom_e_circumflex_bytes = new byte[] {0xEF, 0xBB, 0xBF, 0xC3, 0xAA};
System.IO.File.WriteAllBytes(fname, utf8_bom_e_circumflex_bytes);

EDIT:

I think I have a firm handle on what is going on (although I don't agree with part of .Net's behavior).

  • If the file starts with UTF-8 BOM, and the data is valid UTF-8, then ReadAllText will completely ignore the encoding you passed in and (properly) decode the file as UTF-8. (I have not tested what happens if the BOM is a lie and the file is not really UTF-8) I don't agree with this behavior. I think .Net should either throw an exception or use the encoding I gave it.

  • If the file has no BOM, .Net has no trivial (and 100% reliable) way to determine that the text is not really ISO-8859-1, since most (all?) UTF-8 text is also valid ISO-8859-1, although gibberish. So it just follows your instructions and decodes the file with the encoding you gave it. (I do agree with this behavior)

1
Can you provide a sample of the text in the file that you expect to be problematic?DontThinkJustGo
I thought of that, but what is the best way to do that? I really would want to post a binary file.JoelFan
maybe just a couple Character encodings that you would expect to fail, and we can recreate the text based on that? Or maybe I can just go out and find some obscure utf8 text and use it. I assume its not a specific character, just any invalid ISO-8859-1 character that you are concerned aboutDontThinkJustGo
ISO-8859-1 is an 8bit encoding so I believe characters from 0x00 to 0xFF are allowed.DontThinkJustGo
"since most (all?) UTF-8 text is also valid ISO-8859-1" - that is only true for ASCII bytes 0x20-0x7E, which are identical in UTF-8 and ISO-8859-1. Once you get outside that range, UTF-8 is NOT valid ISO-8859-1. Bytes 0x00-0x1F and 0x7F-0x9F are not defined in ISO-8859-1 (0x00 is debatable, due to its common use as a null terminator), and most non-ASCII bytes in UTF-8 fall within the latter range due to the way UTF-8 encodes its bits.Remy Lebeau

1 Answers

1
votes

should throw an exception, since it is not valid ISO-8859-1 text

In ISO-8859-1 all possible bytes have mappings to characters, so no exception will ever result from reading a non-ISO-8859-1 file as ISO-8859-1.

(True, all the bytes in the range 0x80–0x9F will become invisible control codes that you never want, but they're still valid, just useless. This is true of quite a few of the ISO-8859 encodings, which put the C1 control codes in the range 0x80–0x9F, but not all. You can certainly get an exception with other encodings that leave bytes unmapped, eg Windows-1252.)

If the file starts with UTF-8 BOM, and the data is valid UTF-8, then ReadAllText will completely ignore the encoding you passed in and (properly) decode the file as UTF-8.

Yep. This is hinted at in the doc:

This method attempts to automatically detect the encoding of a file based on the presence of byte order marks.

I agree with you that this behaviour is pretty stupid. I would prefer to ReadAllBytes and check it through Encoding.GetString manually.