Is UTF to EBCDIC Conversion lossless?

Question

We have a process which communicates with an external via MQ. The external system runs on a mainframe maching (IBM z/OS), while we run our process on a CentOS Linux platform. So far we never had any issues.

Recently we started receiving messages from them with non-printable EBCDIC characters embedded in the message. They use the characters as a compressed ID, 8 bytes long. When we receive it, it arrives on our queue encoded in UTF (CCSID 1208).

They need to original 8 bytes back in order to identify our response messages. I'm trying to find a solution in Java to convert the ID back from UTF to EBCDIC before sending the response.

I've been playing around with the JTOpen library, using the AS400Text class to do the conversion. Also, the counterparty has sent us a snapshot of the ID in bytes. However, when I compare the bytes after conversion, they are different from the original message.

Has anyone ever encountered this issue? Maybe I'm using the wrong code page?

Thanks for any input you may have.

Bytes from counterparty(Positions [5,14]):

00000   F0 40 D9 F0 F3 F0 CB 56--EF 80 04 C9 10 2E C4 D4  |0 R030.....I..DM|

Program output:

UTF String: [R030Ã´Ã®Ã•Ã˜ÂœIDMDHP1027W 0510]
EBCDIC String: [R030Ã´Ã®ÃÃÂIDMDHP1027W 0510]
NATIVE CHARSET - HEX:     [52303330C3B4C3AEC395C398C29C491006444D44485031303237572030353130] 
CP500 CHARSET  - HEX:     [D9F0F3F066BE66AF663F663F623FC9102EC4D4C4C8D7F1F0F2F7E640F0F5F1F0]

Here is some sample code:

private void readAndPrint(MQMessage mqMessage) throws IOException {
    mqMessage.seek(150);
    byte[] subStringBytes = new byte[32];
    mqMessage.readFully(subStringBytes);

    String msgId = toHexString(mqMessage.messageId).toUpperCase();

    System.out.println("----------------------------------------------------------------");
    System.out.println("MESSAGE_ID: " + msgId);

    String hexString = toHexString(subStringBytes).toUpperCase();
    String subStr = new String(subStringBytes);
    System.out.println("NATIVE CHARSET - HEX:     [" + hexString + "] [" + subStr + "]");

    // Transform to EBCDIC
    int codePageNumber = 37;
    String codePage = "CP037";

    AS400Text converter = new AS400Text(subStr.length(), codePageNumber);
    byte[] bytesData = converter.toBytes(subStr);
    String resultedEbcdicText = new String(bytesData, codePage);

    String hexStringEbcdic = toHexString(bytesData).toUpperCase();
    System.out.println("CP500 CHARSET  - HEX:     [" + hexStringEbcdic + "] [" + resultedEbcdicText + "]");

    System.out.println("----------------------------------------------------------------");
}

new String(subStringBytes); - this is using your default encoding. Do you know what it is, and do you know that it supports all possible byte combinations that you might get, and do you know if it's reversible? — parsifal
Also, "UTF" is meaningless without a suffix. Are you talking "UTF-8"? If that's the case, then the answer is clearly no, because not all byte sequences are legal in UTF-8 -- including what appear to be the first three bytes of your message. — parsifal
CCSID 1208 in MQ corresponds to UTF-8 (www-01.ibm.com/software/globalization/ccsid/…). When you say not all byte sequences are legal, do you mean because UTF-8 is variable width? — Jose
Not just that it's variable width, but that the high-order bits have meaning (thus my like to the Wikipedia page). The "MESSAGE_ID" that you show starts with C3E2, which is an invalid UTF-8 sequence: C3 is the start of a two-byte sequence, but E2 is not a valid second byte; it's only valid as the first byte of a 3-byte sequence. — parsifal
I want to point out again that new String(subStringBytes) uses your platform default encoding. Maybe that's UTF-8 for you, maybe it isn't. Worse, it might be UTF-8 for you and not UTF-8 on whatever platform you use for deployment. — parsifal

user2338816 user2338816 · Accepted Answer · 2014-03-20T06:06:44

If a MQ message has varying sub-message fields that require different encodings, then that's how you should handle those messages, i.e., as separate message pieces.

But as you describe this, the entire message needs to be received without conversion. The first eight bytes need to be extracted and held separately. The remainder of the message can then have its encoding converted (unless other sub-fields also need to be extracted as binary, unconverted bytes).

For any return message, the opposite conversion must be done. The text portion of the message can be converted, and then that sub-string can have the original eight bytes prepended to it. The newly reconstructed message then can be sent back through the queue, again without automatic conversion.

Your partner on the other end is not using the messaging product correctly. (Of course, you probably shouldn't say that out loud.) There should be no part of such a message that cannot automatically survive intact across both directions. Instead of an 8-byte binary field, it should be represented as something more like a 16-byte hex representation of the 8-byte value for one example method. In hex, there'd be no conversion problem either way across the route.

Is UTF to EBCDIC Conversion lossless?

2 Answers