9
votes

This code,

OutputStream out = new FileOutputStream(new File("C:/file/test.txt"));
out.write("A".getBytes());

And this,

OutputStream out = new FileOutputStream(new File("C:/file/test.txt"));
out.write("A".getBytes(StandardCharsets.UTF_8));

produce the same result(in my opinion), which is UTF-8 without BOM. However, Notepad++ is not showing any information about encoding. I'm expecting notepad++ to show here as Encode in UTF-8 without BOM, but no encoding is being selected in the "Encoding" menu.

Now, this code write the file in UTF-8 with BOM encoding.

 OutputStream out = new FileOutputStream(new File("C:/file/test.txt"));
 byte[] bom = { (byte) 239, (byte) 187, (byte) 191 };
 out.write(bom);
 out.write("A".getBytes()); 

Notepad++ is also displaying the encoding type as Encode in UTF-8.

Question: What is wrong with the first two codes which are suppose to write the file in UTF-8 without BOM? Is my Java code doing the right thing? If so, is there a problem with notepad++ trying to detect the encoding type?

Is notepad++ only guessing around?

2
The letter A might be UTF-8, or ISO-646, or ISO-8859-1, or ISO-8859-2, or .... There's no way for notepad++ to guess that you are thinking UTF-8.bmargulies
Why the downvote? Anything wrong?Mawia
To the downvoters: does this question really deserve 2 downvotes? At least if you do downvote put a comment as to why.prunge
@prunge: commenting on downvotes is desired, but not required. And that's by design. There's no need to request a comment, since those who downvoted already decided not to comment.Joachim Sauer
If you don't specify the encoding (first example) the JVM will use the operating system default encoding (ANSI for Windows, UTF-8 for Linux).Lluis Martinez

2 Answers

17
votes

"A" written using UTF-8 without a BOM produces exactly the same file as "A" written using ASCII or ISO-8859-* or any other ASCII-compatible encodings. That file contains a single byte with the decimal value 65.

Think of it this way:

  • "A".getBytes("UTF-8") returns a new byte[] { 65 }
  • "A".getBytes("ISO-8859-1") returns a new byte[] { 65 }
  • You write the results of those calls into a file
  • How is the consumer of the file supposed to distinguish the two?

There's nothing in that file that suggests that UTF-8 needs to be used to decode it.

Try writing "Käsekuchen" or something else that's not encodable in ASCII and see if Notepad++ guesses the encoding correctly (because that's exactly what it does: it makes an educated guess, there's no metadata that tells it which encoding to use).

0
votes

I do not know if my answer is correct but let me put my understanding here,

As explained above if you write "A" simply notepad++ has no way to understand which type of encoding it is but if you want notepad++ to show "Encode in UTF-8 without BOM" as shown in figure below

enter image description here

Then you must fool Notepad++ which you can do it using following piece of code enter image description here

If you want notepad++ to show "Encode in UTF-8" then you should remove the substring part from osw.write("\uFEFF") because this is a BOM character which you are trying to insert. When you insert this character then the file encoding type would become "Encode to UTF-8" and when you remove programmatically then it would become "Encode in UTF-8 without BOM" as you have removed this BOM character.

Another setting you have to do is change the preferences of Notepad++ as shown below, By doing this only will the Notepad++ be able to recognize the encoding you want to.

enter image description here

However if you simply write text it would be treated as "ANSI" by notepad++.

Hope my explanation is clear and my analysis would help someone. However this approach is a work around and is not suggested but in a helpless scenario this works.

If you do not want your Notepad++ preferences to be changed and still you want the encoding to be "Encode in UTF-8 without BOM" then you must do something like this,

enter image description here

I have explained samething probably in a better way in my blog here