0
votes

I have a database with NLS_CHARACTERSET = AL32UTF8.

Using JDBC with the Oracle thin driver, and a prepared statement with a bound parameter, I write the UTF-16 java String "\uD83D \uDE33" to a VARCHAR2 field.

Using SQLPLUS and selecting the field with utl_raw.cast_to_raw, I can see the following raw values: EFBFBD20EDB8B3, ie: "\uFFFD \uDE33"

When I read the field back into a UTF-16 java String using JDBC, the unicode values get converted to the replacement char, ie: "\uFFFD \uFFFD".

How can I get the original Unicode string back out?

Because what I read back is "\FFFD \FFFD", if I use this value in a where condition, it will never match what is in the database.

1
What is your Java locale/character set?Alex Poole
java.util.Locale.getDefault() is "en_AU" java.nio.charset.Charset.defaultCharset() is "UTF-8"tekumara
It looks like each of the two surrogate code points were directly stored in UTF-8 in the database. That is invalid under current definitions of UTF-8. Somewhere between your application and the database column, the surrogate pair in UTF-16 should have been converted to its Unicode code point, and that code point converted to UTF-8. The entire pair should be three to four bytes in UTF-8, depending on the code point. Invalid UTF-8 is sometimes replaced with the replacement character.Shannon Severance
There is a space between the two UTF-16 surrogate code units which I think is causing the encoding problem. The first surrogate code unit is stored in the database as the UTF-8 replacement char, but not the second.tekumara
I didn't read closely and missed the space and the encoding of one surrogate to the replacement char. In this case, they both should have both been saved as replacement chars, as you state in your answer.Shannon Severance

1 Answers

0
votes

The problem appears to be how the Oracle JDBC thin driver handles single surrogate code units. In the database, it ends up storing a single surrogate code unit, rather than replacing it with the replacement character. But when the field is read, is replaces single surrogate code units with the replacement character. So what is written, versus stored, versus retrieved, are all different.

What I would expect, is that what is stored in the database is what is read back. So when storing an invalid UTF-16 string it should store single surrogates as the replacement char, rather than the singular surrogate code units. That way, I can use any value read back in a where condition of a subsequent query.