1
votes

On our application server, the DevOps team are using SQL_ASCII encoding in the Postgres (9.4) DB.

A 3rd-party application is inserting Surnames into the Employee table with accented characters, e.g. Núñez

My Java (8) Application is a Spring (4.3.15) WebApp using Mybatis (3.2.4)

When my application reads such Surnames out of the SQL_ASCII db, I get:

org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0xe3 0xa1 0x54 at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2182) at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1911) at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:173) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:616) at org.postgresql.jdbc2.AbstractJdbc2Statement.executeWithFlags(AbstractJdbc2Statement.java:466) at org.postgresql.jdbc2.AbstractJdbc2Statement.execute(AbstractJdbc2Statement.java:459) at org.apache.tomcat.dbcp.dbcp2.DelegatingPreparedStatement.execute(DelegatingPreparedStatement.java:93) at org.apache.tomcat.dbcp.dbcp2.DelegatingPreparedStatement.execute(DelegatingPreparedStatement.java:93) at jdk.internal.reflect.GeneratedMethodAccessor78.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.apache.ibatis.logging.jdbc.PreparedStatementLogger.invoke(PreparedStatementLogger.java:55) at com.sun.proxy.$Proxy98.execute(Unknown Source)

If I try changing the client_encoding via:

SET client_encoding = 'SQL_ASCII';

Then I get error:

org.postgresql.util.PSQLException: The server's client_encoding parameter was changed to LATIN1. The JDBC driver requires client_encoding to be UTF8 for correct operation. at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1950)

How can I "safely" read these characters from the DB?

2

2 Answers

1
votes

You are lost. An SQL_ASCII database is not encoding aware, it treats all bytes (except the 0 byte) equal. There will be no encoding conversion in the database.

So unless the data is accidentally encoded as UTF-8, which it isn't (according to the errror message), you cannot use it with the JDBC driver.

You will have to dump the database and restore it (using the appropriate -E option) into a different database (v13) with a proper encoding. Any encoding inconsistencies will have to be fixed manually during this process.

This question will provide additional insight.

1
votes

allowEncodingChanges=true

Can you try setting allowEncodingChanges=true in the JDBC connection URL? (and also characterEncoding)

allowEncodingChanges = boolean

When using the V3 protocol the driver monitors changes in certain server configuration parameters that should not be touched by end users. The client_encoding setting is set by the driver and should not be altered. If the driver detects a change it will abort the connection. There is one legitimate exception to this behaviour though, using the COPY command on a file residing on the server's filesystem. The only means of specifying the encoding of this file is by altering the client_encoding setting. The JDBC team considers this a failing of the COPY command and hopes to provide an alternate means of specifying the encoding in the future, but for now there is this URL parameter. Enable this only if you need to override the client encoding when doing a copy.

Ref: Chapter 3. Initializing the Driver

Bytes in the error message are 1110 0011, 1010 0001, 0101 0100

If the stored data was encoded in ISO-8859-1, this will be ã, ¡, T.

When this byte stream was read as UTF-8, 1110MSB in first byte indicates a 3 UTF-8 byte character(counting itself).

So the following next 2 bytes should start with 10MSB. But the 3rd byte is starting with 01MSB

By default, JDBC driver is decoding this stream in UTF-8 and failing on invalid a byte stream.

Assuming the third byte had started with 10MSB, the code would work without error, but it can incorrectly map all these 3 bytes into a single Unicode code point(assuming the original encoding was not UTF-8 and the corresponding unicode code point has a valid character representation).