5
votes

I'm building a gwt app that stores the text of random webpages in a datastore text field. Often the text is formatted UTF-8. All the files of my app are stored as UTF-8 and when I run the application on my local machine the entire process works fine. UTF-8 text is stored as such and retrievable ftom the local version of the app engine as UTF-8. However when I deploy the app to the google app engine somewhere between when I store the text and when I retrieve it it is no longer UTF-8 which causes non-ascii characters to be displayed as ?.

When I view the datastore in the appengine control panel all the special characters appear as ? which leads me to believe that it is a problem when writing to the database.

Does anyone know how to fix this?

The app itself is a little big. Here's some pseudocode:

Text webPageText = new Text(<STRING THAT CONTAINS UNICODE CHARACTERS>);

/*Some Code to store Text object on datastore
Specifically I'm using javax.jdo.PersistenceManager to do this.
Some Code to retrieve text from datastore. */

String retrievedText = webPageText.getValue();

The problem is that retrievedText comes back with ? instead of unicode characters.

Here's a similar problem in python that I found: Trying to store Utf-8 data in datastore getting UnicodeEncodeError. Though my app is not getting any errors.

Unfortunately I think Java strings are default utf-8 and I can't find any code that will let me declare them explicitly as utf-8.

Edit: I've now built a small webapp that takes in unicode text and stores it in the datastore and then retrieves it with no problems. I still have no idea where the problem is in my original source code but I'm going to change the way my code handles webpage retrieval to match the smaller app that I just built. Thank you everyone for your help.

4
Could you post the relevant bits of code?David Underhill
You say you think the problem is with storage and retrieval, then don't include the code you're using to store and retrieve the data! We need the relevant code if we're to help at all.Nick Johnson
The source for the entire project is now posted above. In a couple of hours time I will try to make a small version that reproduces the problem.Richard Wallis
@RichardWallis Did you find a solution for this please. 2 years from when you encountered this, someone is still having this issue.Babajide Prince

4 Answers

3
votes

Fixed same issue by setting both request and response encoding to utf-8. Request encoding results in valid string stored in datastore, without it values will be stored as "????..."

Requests: if you use Apache HTTP Client, this is done in the following way:

Get request:

NameValuePair... params;
...
String url = urlBase + URLEncodedUtils.format(Arrays.asList(params), "UTF-8");
HttpGet httpGet = new HttpGet(url);

Post request:

NameValuePair... params;
...
HttpPost httpPost = new HttpPost(url);
httpPost.setEntity(new UrlEncodedFormEntity(Arrays.asList(params), "UTF-8"));

Response: if you build your response in HttpServlet, this is done in a following way:

HttpServletResponse resp;
...
resp.setContentType("text/html; charset=utf-8");
1
votes

I tried to convert String to ByteArray and then store it as datastore blob.

//Save String as Blob
Blob webPageText = new Blob(<STRING THAT CONTAINS UNICODE CHARACTERS>.getBytes());

//Retrieve Blob as String
String retrievedText = new String(webPageText.getBytes());

I originally thought this had solved the problem but I had by mistake only tested it on my local server. This code still returns ? instead of unicode characters which leads me to believe that the problem isn't in the datastore but in the transfer from the app engine to the client.

1
votes

Encoding Solution: Cause Browser use "8859_1" charset
=> Before
Save Datastore, I convert charset.

new String(req.getParameter("title").getBytes("8859_1"),"utf-8")

When I ran this application on my local machine, it was fine. But when I deployed, I faced the same issue you saw. I solved this problem by:

After
=> Save Datastore Code.

new String(req.getParameter("title").getBytes("utf-8"),"utf-8")