2
votes

The application I am developing will be used by folks in Western & Eastern Europe as well in the US. I am encoding my input and decoding my output with UTF-8 character set.

My confusion is becase when I use this method String(byte[] bytes, String charsetName), I provide UTF-8 as the charsetname when it really is an character encoding. And my default econding is set in Eclipse as Cp1252.

Does this mean if, in the US in my Java application, I create an Output text file using Cp1252 as my charset encoding and UTF-8 as my charset name, will the folks in Europe be able to read this file in my Java application and vice versa?

3
How would you "create an Output text file using Cp1252 as my charset encoding and UTF-8 as my charset name"? - Jon Skeet
@ Jon Skeet...so how do the files get encoded? I thought it uses the OS default character encoding....correct? - user547453
I can't answer that without seeing some code. I'd normally use FileOutputStream wrapped in an OutputStreamWriter, so it'll use whatever encoding I specify :) - Jon Skeet
@JonSkeet...even OutputStreamWriter using the same OutputStreamWriter(OutputStream out, Charset cs) where Charset is like UTF-8/16. Where do we specify the encoding like 'Cp1252' OR 'ISO-8859-1'? - user547453

3 Answers

11
votes

They're encodings. It's a pity that Java uses "charset" all over the place when it really means "encoding", but that's hard to fix now :( Annoyingly, IANA made the same mistake.

Actually, by Unicode terminology they're probably most accurately character encoding schemes:

A character encoding form plus byte serialization. There are seven character encoding schemes in Unicode: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE.

Where a character encoding form is:

Mapping from a character set definition to the actual code units used to represent the data.

Yes, the fact that Unicode only defines seven character encoding forms makes this even more confusing. Fundamentally, all most developers need to know is that a "charset" in Java terminology is a mapping between text data (String, char[]) and binary data (byte[]).

1
votes

I think those two things are not directly related.

The Eclipse setting decide how your eclipse editor will save the text file (typically source code) you created/edited. You can use other editors and therefore the file maybe saved in some other encoding scheme. As long as your java compiler has no problem compiling your source code you're safe.

The java String(byte[] bytes, String charsetName) is your own application logic that deals with how do you want to interpret some data your read either from a file or network. Different charsetName (essentially different character encoding scheme) may have different interpretation on the byte array.

1
votes

A "charset" does implies the set of characters that the text uses. For UTF-8/16, the character set happens to be "all" characters. For others, not necessarily. Back in the days, everybody were inventing their own character sets and encoding schemes, and the two were almost 1-to-1 mapping, therefore one name can be used to refer to both character set and encoding scheme.