Java 8 UTF-16 isn't default charset but UTF-8

Question

I been doing some coding with String in Java8,Java 11 but this question is based on Java 8. I have this little snippet.

final char e = (char)200;//È

I just thought that the characters between 0.255[Ascii+extended Ascii] would always fit in a byte just because 2^8=256 but this seems not to be true i have try on the website https://mothereff.in/byte-counter and states that the character is taking 2 bytes can somebody please explain to me.

Another question in a lot of post states that Java is UTF-16 but in my machine running Windows 7 is returning UTF-8 in this snippet.

String csn = Charset.defaultCharset().name();

Is this platform depent?

Other questions i have try this snippet.

final List<Charset>charsets = Arrays.asList(StandardCharsets.ISO_8859_1,StandardCharsets.US_ASCII,StandardCharsets.UTF_16,StandardCharsets.UTF_8);
    charsets.forEach(a->print(a,"È"));
    System.out.println("getBytes");
    System.out.println(Arrays.toString("È".getBytes()));
    charsets.forEach(a->System.out.println(a+" "+Arrays.toString(sb.toString().getBytes(a))));

private void print(final Charset set,final CharSequence sb){
    byte[] array = new byte[4];              
    set.newEncoder()
            .encode(CharBuffer.wrap(sb), ByteBuffer.wrap(array), true);
    final String buildedString = new String(array,set);
    System.out.println(set+" "+Arrays.toString(array)+" "+buildedString+"<<>>"+buildedString.length());    
}

And prints

run:
ISO-8859-1 [-56, 0, 0, 0] È//PERFECT USING 1 BYTE WHICH IS -56
US-ASCII [0, 0, 0, 0] //DONT GET IT SEE THIS ITEM FOR LATER
UTF-16 [-2, -1, 0, -56] È<<>>1 //WHAT IS -2,-1 BYTE USED FOR? I HAVE TRY WITH OTHER EXAMPLES AND THEY ALWAYS APPEAR AM I LOSING TWO BYTES HERE??
UTF-8 [-61, -120, 0, 0] 2 È //SEEMS TO MY CHARACTER NEEDS TWO BYTES?? I THOUGHT THAT CODE=200 WOULD REQUIRE ONLY ONE
getBytes
[-61, -120]//OK MY UTF-8 REPRESENTATION
ISO-8859-1 [-56]//OK
US-ASCII [63]//OK BUT WHY WHEN I ENCODE IN ASCCI DOESNT GET ANY BYTE ENCODED?
UTF-16 [-2, -1, 0, -56]//AGAIN WHAT ARE -2,-1 IN THE LEADING BYTES?
UTF-8 [-61, -120]//OK

I have try

System.out.println(new String(new byte[]{-1,-2},"UTF-16"));//SIMPLE "" I AM WASTING THIS 2 BYTES??

In resume.

Why UTF-16 always has two leading bytes are they wasted? new byte[]{-1,-2}
Why when i encode "È" i dont get any bytes in ASCCI Charset but when i do È.getBytes(StandardCharsets.US_ASCII) i get {63}?
Java uses UTF-16 but in my case UTF-8 is platform depend??

Sorry if this post is confussing

Environment

Windows 7 64 Bits Netbeans 8.2 with Java 1.8.0_121

Tom Blodget Tom Blodget · Accepted Answer · 2019-03-10T14:39:14

Let's back up a bit…

Java's text datatypes use the UTF-16 character encoding of the Unicode character set. (As do, VB4/5/6/A/Script, JavaScript, .NET, ….) You can see this in the various operations you do with the string API: indexing, length, ….

Libraries support converting between the text datatypes and byte arrays using various encodings. Some of them are categorized as "Extended ASCII", but stating that is a very poor substitute for naming the character encoding actually being used.

Some operating systems allow the user to designate a default character encoding. (Most users don't know or care, though.) Java attempts to pick this up. It is only useful when the program understands that input from the user is that character encoding or that output should be. This century, users dealing in text files prefer to use a specific encoding, communicate them unchanged across systems, don't appreciate lossy conversions and therefore don't have any use for this concept. From a program's point of view, it is never what you want unless it is exactly what you want.

Where a conversion would be lossy, you have the choice of a replacement character (such a '?'), omitting it, or throwing an exception.

A character encoding is a map between a codepoint (integer) of a character set and one or more code units, according to the definition of the encoding. A code unit is a fixed size and the number of code units needed for a codepoint, might vary by codepoint.

In libraries, it is not generally useful to have an array of code units so they take the further step of converting to/from an array of bytes. byte values do range from -128 to 127, however, that's the Java interpretation as two's complement 8-bit integers. As the bytes are understood to be encoding text, the values would be interpret according to the rules of the character encoding.

Because some Unicode encodings, have code units more than one byte long, byte order becomes important. So, at the byte array level, there is UTF-16 Big Endian and UTF-16 Little Endian. When communicating a text file or stream, you would send the bytes and well as having a shared knowledge of the encoding. This "metadata" is required for understanding. So, UTF-16BE or UTF-16LE, for example. To make that a bit easier, Unicode allows some metadata beginning of the file or stream to indicate the byte order. It is called the byte-order mark (BOM) So, the external metadata can share the encoding (say, UTF-16), while the internal metadata shares the byte order. Unicode allows the BOM to be present even when byte order is not relevant, such as UTF-8. So, if the understanding is that the bytes are text encoded with any Unicode encoding and a BOM is present, then it's a very simple matter to figure out which Unicode encoding it is and what the byte order is, if relavent.

1) You are seeing the BOM in some of your Unicode encoding outputs.

2) È is not in the ASCII character set. What would want to happen in this case? I often prefer an exception.

3) The system you were using, for your account, at the time of your tests, may have had UTF-8 as the default character encoding, Is that important to the way you want and have encoded your text files on that system?

Java 8 UTF-16 isn't default charset but UTF-8

2 Answers