Heroku replacing UTF-8 bytes with � (0xEF 0xBF 0xBD)

Question

I'm facing a charset problem (UTF-8) in a java file hosted on Heroku.

Better explaining it using a small example:

// '…' UTF-8 encoding is 0xE2 0x80 0xA6
// stringToHex() outputs the HEX value to console/log
stringToHex(new String("…".getBytes(), "UTF-8"));

Now, everything works perfectly locally (Tomcat 7)—"0xE2 0x80 0xA6" is output in the console.

When I try it on the staging server, hosted on Heroku (Jetty 7), "0xEF 0xBF 0xBD 0xEF 0xBF 0xBD 0xEF 0xBF 0xBD" is written to the log instead.

Both the servers are running java with the parameter "-Dfile.encoding=UTF-8" (so Charset.defaultCharset().toString() outputs "UTF-8" in both).

Can anyone help me solving this bizarre problem?

Thanks.

Update - forgot to say: all the files are encoded in UTF-8 and compiled using javac -encoding UTF-8

Update 2 - tried with '£' instead of '…' and on the staging server I get "0xEF 0xBF 0xBD 0xEF 0xBF 0xBD" instead of "0xC2 0xA3"... Seems like it's always converting every single byte to "0xEF 0xBF 0xBD" (which corresponds to �)... ???

Update 3 - since Heroku is using Jetty, I tried using Jetty locally and everything is working perfectly.

Update 4 - here is my stringToHex() function:

private void stringToHex(String string) throws UnsupportedEncodingException {
    String result = "";
    String tmp;
    for(byte b : string.getBytes("UTF-8")) {
        tmp = Integer.toHexString(0xFF & b);
        if(tmp.length() == 1) {
            tmp += '0';
        }

        result += "0x" + tmp.toUpperCase() + " ";
    }

    logger.info(result);
}

To compile in UTF-8 I use the maven-compiler-plugin. pom.xml relevant part:

<plugins>
    ...
    <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-compiler-plugin</artifactId>
        <version>2.3.2</version>
        <configuration>
            <encoding>UTF-8</encoding>
        </configuration>
    </plugin>
    ...
</plugins>

Are you using the same JVM on both hosts? There has never been a requirement for JVMs to support the -Dfile.encoding=UTF-8 system property and not all encoding libraries use it - even in the Sun libraries. See bug 4163515. Providing the encoding explicitly in transcoding operations is the only safe way to write portable code. — McDowell
Thank you for your answer, @McDowell. Locally running 1.7.0_02, on Heroku running 1.6.0_20. Even if the one of the JVMs doesn't support the -Dfile.encoding=UTF-8, on both servers Charset.defaultCharset().toString() returns "UTF-8". How do you provide the encoding explicitly? Thanks — satoshi
use the String.getBytes(Charset) or String.getBytes(String) methods to create your byte array. — McDowell
I tried it already, stringToHex("£"), stringToHex(new String("£".getBytes(), "UTF-8")) and stringToHex(new String("£".getBytes("UTF-8"), "UTF-8")) all give the same wrong output. — satoshi
Assuming your compilation step is OK (try the escaped string "\u2026",) it sounds like there's a defect in the stringToHex method. What encoding does it convert to? If it was outputting the String value in native Java UTF-16BE form, it would be 0x20 0x26. — McDowell

satoshi satoshi · Accepted Answer · 2012-03-25T11:28:11

The problem was due to the AspectJ configuration. If you want to use AspectJ with Java and Spring you have to specify the encoding in the plugin configuration:

<plugins>
    ...
    <plugin>
        <groupId>org.codehaus.mojo</groupId>
        <artifactId>aspectj-maven-plugin</artifactId>
        <version>1.0</version>
        <dependencies>
            <dependency>
                <groupId>org.aspectj</groupId>
                <artifactId>aspectjrt</artifactId>
                <version>1.6.10</version>
            </dependency>
            <dependency>
                <groupId>org.aspectj</groupId>
                <artifactId>aspectjtools</artifactId>
                <version>1.6.10</version>
            </dependency>
        </dependencies>
        <executions>
            <execution>
                <goals>
                    <goal>compile</goal>
                    <goal>test-compile</goal>
                </goals>
            </execution>
        </executions>
        <configuration>
            <outxml>true</outxml>
            <verbose>true</verbose>
            <showWeaveInfo>true</showWeaveInfo>
            <aspectLibraries>
                <aspectLibrary>
                    <groupId>org.springframework</groupId>
                    <artifactId>spring-aspects</artifactId>
                </aspectLibrary>
            </aspectLibraries>
            <source>1.6</source>
            <target>1.6</target>
            <encoding>UTF-8</encoding>
        </configuration>
    </plugin>
    ...
</plugins>

Heroku replacing UTF-8 bytes with � (0xEF 0xBF 0xBD)

1 Answers