1
votes

I have a website, and need to store data from a text field into a mysql database. The frontend is perl. I used utf8::encode to encode the data into utf8.

The request is made to the Java backend which connects to the mysql db and inserts this text. For the table the default charset is set to utf8. This works in many cases, but it fails in some cases.

If I use テスト, the data stored in the database shows questions marks: ã??ã?¹ã??. If I try to insert the utf8 encoded string directly from the sql browser, everything works fine. Update events set summary = ãã¹ã where event_id = 11657;

While inserting I noticed there are some blank characters that show up in the mysql query browser, something like: ã ã¹ ã. After inserting from here, data in the database shows some boxes in the database instead of these spaces, and テスト displays correctly on the website after utf8 decoding it. The problem is only when I insert directly from the website, these special characters come up as question marks in the database.

Can someone please help me with these special characters? Do I need to handle them differently?

4
Sounds like somewhere along the data flow there's an encoding problem. It's hard to say where without knowing anything about your actual code, but a common culprit in Java is using String.getBytes() (without an encoding argument). Another possibility is encoding issues in communicating the web page.Ted Hopp
-1 downvote for not providing code, no one can actually reproduce the problem, and all the answers (4 so far) are full of assumptions. English prose is a poor substitute for code.daxim

4 Answers

0
votes

We have also faced similar issue in one of the projects.So we had to write a small routine to convert those utf8 characters into html encoded and store into the database.

0
votes

Use StringEscapeUtils from Apache Commons Lang:

import static org.apache.commons.lang.StringEscapeUtils.escapeHtml;
// ...
String source = "The less than sign (<) and ampersand (&) must be escaped before using them in HTML";
String escaped = escapeHtml(source);
0
votes

If the database really stored テスト, that's what you should see in the sql browser instead of mojibake.

It sounds like the Java backend is interpreting what Perl sends as ISO-8859-1 rather than UTF-8. This explains hows gets converted into \u00E3\u0083\u0086. Then the backend tries to send the data to the database in Windows-1252 - the MySQL default encoding. Unfortunately Windows-1252 cannot represent the Unicode characters in the range \u0080-\u009F, so the last two characters are replaced by question marks.

So you have two problems:

  1. You should make the Java backend read the request in UTF-8 rather than in ISO-8859-1.
  2. The backend should use UTF-8 when talking with the database. The easiest way to do this is adding characterEncoding=utf8 to the connection parameters.
0
votes

I'm assuming that you are sending POST parameters.

I think that the most likely cause of your initial problem is one of the following:

  1. If the parameters are being sent in the HTTP request body, your Perl front-end is probably not setting the encoding in the content type header of the request. The webserver is probably to assuming ISO-8859-1. The solution to this is to set the request content type properly.

  2. If the parameters are sent in the HTTP request URL, your web server is using the wrong characterset when decoding the request parameters. The solution to this is going be web-server specific ...

It sounds like there might also be a character set problem in talking to the database, but that might just be a consequence of earlier mangling.