2
votes

I am trying to debug a weird issue, hoping a Unicode expert here would be able to help.

  • I have a (Perl based) sender program, which takes some data structure
  • it encodes the data structure into a proprietary serialized format which uses curly braces for encoding the data. Here's an example serialized string: {{9}{{8}{{skip_association}{{0}{}}}{{data}{{9}{{1}{{exceptions}{{9}{{1}{{-472926}{{9}{{1}{{AAAAAAYQ2}
  • it then sends that serialized string to a Java server
  • Java server tries to de-serialize the string back into a data structure.
  • The encoding does not really matter too much (imho) other than it uses field length as part of encoded data; e.g. {{id}{{7}9{Z928D2AA2}}} means a field named "id", of type "string" (7), length of string 9, value Z928D2AA2.

Problem: When the data structure being serialized contains some specific Unicode character(s), the de-serialization fails.

Specifically, this character: "" (which various online decoders display as %82 or 0x82) causes the issue.

I'm trying to understand why this would be an issue and what's so special about this character - there are other Unicode characters that do not break the de-serializer.

Is there something special about (aka 0x82) Unicode character that would interfere with parsing a serialized string dependent on curly braces as separators and field lengths being known?

Unfortunately, I am unable to debug the decodig library, so I only get a generic error message that decoding failed without any idea what about it failed.

P.P.S Double extra curious: when I used that character in the title of SO question, it printed in the preview, but got deleted when the question was posted!!! When I tried to copy/paste the strings into the editor, their measured length was correct compared to encoded string length

P.S. The Perl code doing the serialization as far as I know is fully Unicode compliant:

use open      qw(:std :utf8);    # undeclared streams in UTF-8
use charnames qw(:full :short);  # unneeded in v5.16
use Encode qw(decode);
1
It's really impossible to say without knowing anything about the serialization format or implementation.Grinnz
@Grinnz - I'm hoping this Unicode character is something special (like, equivalent to a closing curly brace or something; or has weird length calculations)DVK
The only thing special about this character vs other Unicode characters is that it can be represented in cp1252 (the native single byte encoding of most US systems).Grinnz
Also, I'd recommend :encoding(UTF-8) instead of :utf8, as the latter is an internal use layer which can end up creating an invalid string if you feed it garbage.Grinnz
I don't know what you mean by "backwards decoding". You can't print unicode characters, they must always be encoded to something for serialization, and the bytes will be different if they are not ASCII characters. That's why you should verify what the decoded character is.Grinnz

1 Answers

3
votes

You can see information about characters in the unicode character database; a text dump of that can be found at https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt where it shows:

0082;<control>;Cc;0;BN;;;;;N;BREAK PERMITTED HERE;;;;

The meanings of the fields can be found at http://www.unicode.org/reports/tr44/#UnicodeData.txt (though that seems to omit the first field, which is the codepoint).

So it is an "other" class control character, with Bidirectional Category "Boundary Neutral" (which is normal for a Cc or Cf class character). There isn't anything else special about it.

But being a control character, it doesn't surprise me that something expecting text data has a problem with it.