I am trying to debug a weird issue, hoping a Unicode expert here would be able to help.
- I have a (Perl based) sender program, which takes some data structure
- it encodes the data structure into a proprietary serialized format which uses curly braces for encoding the data. Here's an example serialized string:
{{9}{{8}{{skip_association}{{0}{}}}{{data}{{9}{{1}{{exceptions}{{9}{{1}{{-472926}{{9}{{1}{{AAAAAAYQ2}
- it then sends that serialized string to a Java server
- Java server tries to de-serialize the string back into a data structure.
- The encoding does not really matter too much (imho) other than it uses field length as part of encoded data; e.g.
{{id}{{7}9{Z928D2AA2}}}
means a field named "id", of type "string" (7), length of string 9, value Z928D2AA2.
Problem: When the data structure being serialized contains some specific Unicode character(s), the de-serialization fails.
Specifically, this character: "" (which various online decoders display as %82
or 0x82
) causes the issue.
I'm trying to understand why this would be an issue and what's so special about this character - there are other Unicode characters that do not break the de-serializer.
Is there something special about (aka 0x82) Unicode character that would interfere with parsing a serialized string dependent on curly braces as separators and field lengths being known?
Unfortunately, I am unable to debug the decodig library, so I only get a generic error message that decoding failed without any idea what about it failed.
P.P.S Double extra curious: when I used that character in the title of SO question, it printed in the preview, but got deleted when the question was posted!!! When I tried to copy/paste the strings into the editor, their measured length was correct compared to encoded string length
P.S. The Perl code doing the serialization as far as I know is fully Unicode compliant:
use open qw(:std :utf8); # undeclared streams in UTF-8
use charnames qw(:full :short); # unneeded in v5.16
use Encode qw(decode);
:encoding(UTF-8)
instead of:utf8
, as the latter is an internal use layer which can end up creating an invalid string if you feed it garbage. – Grinnz