Why do I have to encode a utf-8 parameter String to iso-Latin and then decode as utf-8 to get Java utf-8 String?

Question

I have a Java servlet that takes a parameter String (inputString) that may contain Greek letters from a web page marked up as utf-8. Before I send it to a database I have to convert it to a new String (utf8String) as follows:

String utf8String = new String(inputString.getBytes("8859_1"), "UTF-8");

This works, but, as I hope will be appreciated, I hate doing something I don't understand, even if it works.

From the method description in the Java doc the getBytes() method "Encodes this String into a sequence of bytes using the named charset, storing the result into a new byte array" i.e. I am encoding it in 8859_1 — isoLatin. And from the Constructor description "Constructs a new String by decoding the specified array of bytes using the specified charset" i.e. decodes the byte array to utf-8.

Can someone explain to me why this is necessary?

If you are hand-coding the Ajax call, what character encoding are you using on the call, i.e. on the POST method you're sending to the server? Can you capture the HTTP request and show it? — Andreas
I'm using a GET request and looking at my js I don't see the character encoding of the request specified. I am using the javascript encodeURIComponent() method to encode the string. I can't find the request — am using Mac Safari and must be looking in the wrong place in the Developer console. I just wonder if I've jumped the gun. I'm trying to write up all the utf-8 encoding stuff I've used over the years in a dozen or so Java apps. I should have checked this operation was really necessary before posting, so I could be sure how to break it. Give me time to do that please. — David
OK, I have another simpler servlet that has javascript create a faux form to send a string from a web input to the Servlet. The HTTP request includes the Greek character as it's a get request I can see in my URL field: localhost:8080/MidgutAtlas/…. But I don't think you can specify the charset on a HTTP request, only on the response. And, yes the line of code is necessary. — David
α is the greek small letter alpha, but if you had called encodeURIComponent() on the value α-Est4, the α should have been escaped as %CE%B1 (UTF-8 hex). — Andreas
I agree. Safari is probably being clever and rendering this for the user. On the assumption that you can't specify the charset of a request because originally it was just a URL and ISO-Latin-1 was assumed, irrespective of arguments to web apps, I changed 8859_1 to UTF-8, on the assumption that all Iso-Latin-1 characters are the same in UTF-8 and any multibyte characters will be obvious. However this gave me a response No results found for ‘Î±-Est4’. These are the individual values of %CE and %B1. Perhaps the problem is URL escapes are not unicode and hence you have to get the bytes first. — David

David David · Accepted Answer · 2016-03-23T12:18:07

My question is based on a misconception regarding the character set used for the HTTP request. I had assumed that because I marked up the web page from which the request was sent as UTF-8 the request would be sent as UTF-8, and so the Greek characters in the parameter sent to the servlet would be read as a UTF-8 String (‘inputString’ in my line of code) by the HttpRequest.getParameter() method. This is not the case.

HTTP requests are sent as ISO-8859-1 (POST) or ASCII (GET), which are generally the same. This is part of the URI Syntax specification — thanks to Andreas for pointing me to http://wiki.apache.org/tomcat/FAQ/CharacterEncoding where this is explained.

I had also forgotten that the encoding of Greek letters such as α for the request is URL-encoding, which produces %CE%B1. The getParameter() handles this by decoding it as two ISO-8859-1 characters, %CE and %B1 — Î and ± (I checked this).

I now understand why this needs to be turned into a byte array and the bytes interpreted as UTF-8. 0xCE does not represent a one-byte character in UTF-8 and hence it is addressed with the next byte, 0xB1, to be interpretted as α. (Î is 0xC3 0x8E and ± is 0xC2 0xB1 in UTF-8.)

Why do I have to encode a utf-8 parameter String to iso-Latin and then decode as utf-8 to get Java utf-8 String?

2 Answers