I'm facing a charset problem (UTF-8) in a java file hosted on Heroku.
Better explaining it using a small example:
// '…' UTF-8 encoding is 0xE2 0x80 0xA6
// stringToHex() outputs the HEX value to console/log
stringToHex(new String("…".getBytes(), "UTF-8"));
Now, everything works perfectly locally (Tomcat 7)—"0xE2 0x80 0xA6" is output in the console.
When I try it on the staging server, hosted on Heroku (Jetty 7), "0xEF 0xBF 0xBD 0xEF 0xBF 0xBD 0xEF 0xBF 0xBD" is written to the log instead.
Both the servers are running java with the parameter "-Dfile.encoding=UTF-8" (so Charset.defaultCharset().toString()
outputs "UTF-8" in both).
Can anyone help me solving this bizarre problem?
Thanks.
Update - forgot to say: all the files are encoded in UTF-8 and compiled using javac -encoding UTF-8
Update 2 - tried with '£' instead of '…' and on the staging server I get "0xEF 0xBF 0xBD 0xEF 0xBF 0xBD" instead of "0xC2 0xA3"... Seems like it's always converting every single byte to "0xEF 0xBF 0xBD" (which corresponds to �)... ???
Update 3 - since Heroku is using Jetty, I tried using Jetty locally and everything is working perfectly.
Update 4 - here is my stringToHex() function:
private void stringToHex(String string) throws UnsupportedEncodingException {
String result = "";
String tmp;
for(byte b : string.getBytes("UTF-8")) {
tmp = Integer.toHexString(0xFF & b);
if(tmp.length() == 1) {
tmp += '0';
}
result += "0x" + tmp.toUpperCase() + " ";
}
logger.info(result);
}
To compile in UTF-8 I use the maven-compiler-plugin. pom.xml relevant part:
<plugins>
...
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3.2</version>
<configuration>
<encoding>UTF-8</encoding>
</configuration>
</plugin>
...
</plugins>
-Dfile.encoding=UTF-8
system property and not all encoding libraries use it - even in the Sun libraries. See bug 4163515. Providing the encoding explicitly in transcoding operations is the only safe way to write portable code. – McDowell-Dfile.encoding=UTF-8
, on both serversCharset.defaultCharset().toString()
returns "UTF-8". How do you provide the encoding explicitly? Thanks – satoshiString.getBytes(Charset)
orString.getBytes(String)
methods to create your byte array. – McDowellstringToHex("£")
,stringToHex(new String("£".getBytes(), "UTF-8"))
andstringToHex(new String("£".getBytes("UTF-8"), "UTF-8"))
all give the same wrong output. – satoshi"\u2026"
,) it sounds like there's a defect in thestringToHex
method. What encoding does it convert to? If it was outputting the String value in native Java UTF-16BE form, it would be 0x20 0x26. – McDowell