Java - UTF8/16 is a Charset Name or Character Encoding?

Question

The application I am developing will be used by folks in Western & Eastern Europe as well in the US. I am encoding my input and decoding my output with UTF-8 character set.

My confusion is becase when I use this method String(byte[] bytes, String charsetName), I provide UTF-8 as the charsetname when it really is an character encoding. And my default econding is set in Eclipse as Cp1252.

Does this mean if, in the US in my Java application, I create an Output text file using Cp1252 as my charset encoding and UTF-8 as my charset name, will the folks in Europe be able to read this file in my Java application and vice versa?

How would you "create an Output text file using Cp1252 as my charset encoding and UTF-8 as my charset name"? — Jon Skeet
@ Jon Skeet...so how do the files get encoded? I thought it uses the OS default character encoding....correct? — user547453
I can't answer that without seeing some code. I'd normally use FileOutputStream wrapped in an OutputStreamWriter, so it'll use whatever encoding I specify :) — Jon Skeet
@JonSkeet...even OutputStreamWriter using the same OutputStreamWriter(OutputStream out, Charset cs) where Charset is like UTF-8/16. Where do we specify the encoding like 'Cp1252' OR 'ISO-8859-1'? — user547453

Jon Skeet Jon Skeet · Accepted Answer · 2013-03-11T20:51:33

They're encodings. It's a pity that Java uses "charset" all over the place when it really means "encoding", but that's hard to fix now :( Annoyingly, IANA made the same mistake.

Actually, by Unicode terminology they're probably most accurately character encoding schemes:

A character encoding form plus byte serialization. There are seven character encoding schemes in Unicode: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE.

Where a character encoding form is:

Mapping from a character set definition to the actual code units used to represent the data.

Yes, the fact that Unicode only defines seven character encoding forms makes this even more confusing. Fundamentally, all most developers need to know is that a "charset" in Java terminology is a mapping between text data (String, char[]) and binary data (byte[]).

Java - UTF8/16 is a Charset Name or Character Encoding?

3 Answers