15
votes

I'm looking for some explanation on how the app engine deals with character encodings. I'm working on a client-server application where the server is on app engine.

This is a new application built from scratch, so we're using UTF-8 everywhere. The client sends some strings to the server through POST, x-www-form-urlencoded. I receive them and echo them back. When the client gets it back, it's ISO-8859-1! I also see this behavior when POSTing to the blobstore, with the parameters sent as UTF-8, multipart/form-data encoded.

For the record, I'm seeing this in Wireshark. So I'm 100% sure I send UTF-8 and receive ISO-8859-1. Also, I'm not seeing mojibake: the ISO-8859-1 encoded strings are perfectly fine. This is also not an issue of misinterpreting the Content-Type. It's not the client. Something along the way is correctly recognizing I'm sending UTF-8 parameters, but is converting them to ISO-8859-1 for some reason.

I'm led to believe ISO-8859-1 is the default character encoding for the GAE servlets. My question is, is there a way to tell GAE not to convert to ISO-8859-1 and instead use UTF-8 everywhere?

Let's say the servlet does something like this:

public void doPost(HttpServletRequest req, HttpServletResponse resp) throws IOException {
    resp.setContentType("application/json");
    String name = req.getParameter("name");
    String json = "{\"name\":\"" + name + "\"}";
    resp.getOutputStream().print(json);
}

I tried setting the character encoding of the response and request to "UTF-8", but that didn't change anything.

Thanks in advance,

4
I don't know about GAE, but your API looks like J2EE Servlets. There ISO-8859-1 is indeed the default. Use resp.setCharacterEncoding() to change the encoding or print binaries directly.ZeissS
GAE uses that same API. I tried setting the character encoding in the response already, it doesn't work. :( Thanks, though.fnf
I haven't used appengine at all, but with all appservers I had to implement a filter to force the encoding to be UTF-8 (because of the 'issue' Zeiss mentioned...stupid servlet spec). You can copy paste the filter from tomcat if you don't want to reinvent the wheel.Augusto
@Augusto: No, I didn't try that. The links I've seen use Spring, I'm not using Spring. Your link isn't loading here. I found that code elsewhere, I'll give it a shot and get back to you, thanks.fnf

4 Answers

17
votes

I see two things you should do.

1) set system-properties (if you are using it) to utf8 in your appengine-web.xml

<system-properties>
    <property name="java.util.logging.config.file" value="WEB-INF/logging.properties" />
    <property name="file.encoding" value="UTF-8" />
    <property name="DEFAULT_ENCODING" value="UTF-8" />
</system-properties>

OK that above is what I have but the docs suggest this below:

<env-variables>
    <env-var name="DEFAULT_ENCODING" value="UTF-8" />
</env-variables>

https://developers.google.com/appengine/docs/java/config/appconfig

2) specify the encoding when you set the content type or it will revert to the default

The content type may include the type of character encoding used, for example, text/html; charset=ISO-8859-4.

I'd try

resp.setContentType("application/json; charset=UTF-8");

You could also try a writer which lets you set the content type to it directly.

http://docs.oracle.com/javaee/1.3/api/javax/servlet/ServletResponse.html#getWriter%28%29
http://docs.oracle.com/javaee/1.3/api/javax/servlet/ServletResponse.html#setContentType(java.lang.String)

For what it's worth, I need utf8 for Japanese content and I have no trouble. I'm not using a filter or setContentType anyway. I am using gwt and #1 above and it works.

7
votes

Found a way to work around it. This is how I did it:

  • Used "application/json; charset=UTF-8" as the content-type. Alternatively, set the response charset to "UTF-8" (either will work fine, no need to do both).

  • Base64-encoded the input strings that aren't ASCII-safe and come as UTF-8. Otherwise they get converted to ISO-8859-1 when they get to the servlet, apparently.

  • Used resp.getWriter() instead of resp.getOutputStream() to print the JSON response.

After all those conditions were met, I was finally able to output UTF-8 back to the client.

1
votes

This is not specific to GAE, but in case you find it useful: I made my own filter:

In web.xml

<filter>
    <filter-name>charsetencoding</filter-name>
    <filter-class>mypackage.CharsetEncodingFilter</filter-class>
</filter>
    ...
<filter-mapping>
   <filter-name>charsetencoding</filter-name>
   <url-pattern>/*</url-pattern> 
</filter-mapping>

(place the filter-mapping fragment quite at the beginning of the filter-mappings, and check your url-pattern.

And

public class CharsetEncodingFilter implements Filter {

    public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException {
        HttpServletRequest req = (HttpServletRequest) request;
        HttpServletResponse res = (HttpServletResponse) response;
        req.setCharacterEncoding("UTF-8");
        chain.doFilter(req, res);
        res.setCharacterEncoding("UTF-8");
    }

    public void destroy() { }

    public void init(FilterConfig filterConfig) throws ServletException { }
}
0
votes

Workaround (safe)

Nothing of these answers worked for me, so I wrote this class to encode UTF-Strings to ASCII-Strings (replacing all chars which are not in the ASCII-table with their table-number, preceded and followed by a mark), using AsciiEncoder.encode(yourString)

The String can then be decoded back to UTF with AsciiEncoder.decode(yourAsciiEncodedString).

package <your_package>;

import java.util.ArrayList;

/**
 * Created by Micha F. aka Peracutor.
 * 04.06.2017
 */

public class AsciiEncoder {

    public static final char MARK = '%'; //use whatever ASCII-char you like (should be occurring not often in regular text)

    public static String encode(String s) {
        StringBuilder result = new StringBuilder(s.length() + 4 * 10); //buffer for 10 special characters (4 additional chars for every special char that gets replaced)
        for (char c : s.toCharArray()) {
            if ((int) c > 127 || c == MARK) {
                result.append(MARK).append((int) c).append(MARK);
            } else {
                result.append(c);
            }
        }
        return result.toString();
    }

    public static String decode(String s) {
        int lastMark = -1;
        ArrayList<Character> chars = new ArrayList<>();
        try {
            //noinspection InfiniteLoopStatement
            while (true) {
                String charString = s.substring(lastMark = s.indexOf(MARK, lastMark + 1) + 1, lastMark = s.indexOf(MARK, lastMark));
                char c = (char) Integer.parseInt(charString);
                chars.add(c);
            }
        } catch (IndexOutOfBoundsException | NumberFormatException ignored) {}

        for (char c : chars) {
            s = s.replace("" + MARK + ((int) c) + MARK, String.valueOf(c));
        }
        return s;
    }
}

Hope this helps someone.