I been doing some coding with String in Java8,Java 11 but this question is based on Java 8. I have this little snippet.
final char e = (char)200;//È
I just thought that the characters between 0.255[Ascii+extended Ascii] would always fit in a byte just because 2^8=256 but this seems not to be true i have try on the website https://mothereff.in/byte-counter and states that the character is taking 2 bytes can somebody please explain to me.
Another question in a lot of post states that Java is UTF-16 but in my machine running Windows 7 is returning UTF-8 in this snippet.
String csn = Charset.defaultCharset().name();
Is this platform depent?
Other questions i have try this snippet.
final List<Charset>charsets = Arrays.asList(StandardCharsets.ISO_8859_1,StandardCharsets.US_ASCII,StandardCharsets.UTF_16,StandardCharsets.UTF_8);
charsets.forEach(a->print(a,"È"));
System.out.println("getBytes");
System.out.println(Arrays.toString("È".getBytes()));
charsets.forEach(a->System.out.println(a+" "+Arrays.toString(sb.toString().getBytes(a))));
private void print(final Charset set,final CharSequence sb){
byte[] array = new byte[4];
set.newEncoder()
.encode(CharBuffer.wrap(sb), ByteBuffer.wrap(array), true);
final String buildedString = new String(array,set);
System.out.println(set+" "+Arrays.toString(array)+" "+buildedString+"<<>>"+buildedString.length());
}
And prints
run:
ISO-8859-1 [-56, 0, 0, 0] È//PERFECT USING 1 BYTE WHICH IS -56
US-ASCII [0, 0, 0, 0] //DONT GET IT SEE THIS ITEM FOR LATER
UTF-16 [-2, -1, 0, -56] È<<>>1 //WHAT IS -2,-1 BYTE USED FOR? I HAVE TRY WITH OTHER EXAMPLES AND THEY ALWAYS APPEAR AM I LOSING TWO BYTES HERE??
UTF-8 [-61, -120, 0, 0] 2 È //SEEMS TO MY CHARACTER NEEDS TWO BYTES?? I THOUGHT THAT CODE=200 WOULD REQUIRE ONLY ONE
getBytes
[-61, -120]//OK MY UTF-8 REPRESENTATION
ISO-8859-1 [-56]//OK
US-ASCII [63]//OK BUT WHY WHEN I ENCODE IN ASCCI DOESNT GET ANY BYTE ENCODED?
UTF-16 [-2, -1, 0, -56]//AGAIN WHAT ARE -2,-1 IN THE LEADING BYTES?
UTF-8 [-61, -120]//OK
I have try
System.out.println(new String(new byte[]{-1,-2},"UTF-16"));//SIMPLE "" I AM WASTING THIS 2 BYTES??
In resume.
Why UTF-16 always has two leading bytes are they wasted? new byte[]{-1,-2}
Why when i encode "È" i dont get any bytes in ASCCI Charset but when i do È.getBytes(StandardCharsets.US_ASCII) i get {63}?
Java uses UTF-16 but in my case UTF-8 is platform depend??
Sorry if this post is confussing
Environment
Windows 7 64 Bits Netbeans 8.2 with Java 1.8.0_121