9
votes

I want to encode and decode binary data within an XML file (with Python, but whatever). I have to face the fact that an XML tag content has illegal characters. The only allowed ones are described in XML specs:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

Which means that the unallowed are:

  • 29 Unicode control characters are illegal (0x00 - 0x20) ie (000xxxxx) except 0x09, 0x0A, 0x0D
  • Any Unicode character representation above 2 bytes (UTF-16+) is illegal (U+D800 - U+DFFF) ie (11011xxx)
  • The special Unicode noncharacters are illegal (0xFFFE - 0xFFFF) ie (11111111 1111111x)
  • <, >, & according to this post for entities content

1 byte can encode 256 possibles. With these restrictions the first byte is limited to 256-29-8-1-3 = 215 possiblities.

Of that first bytes's 215 possibilites, base64 only uses 64 possibilites. Base64 generates 33% overhead (6 bits becomes 1 byte once encoded with base64).

So my question is simple: Is there an algorithm more efficient than base64 to encode binary data within XML? If not, where should we start to create it? (libraries, etc.)

NB: You wouldn't answer this post by "You shouldn't use XML to encode binary data because...". Just don't. You could at best argue why not to use the 215 possibilities for bad XML parser's support.

NB2: I'm not speaking about the second byte but there are certainly some considerations that wa can develop regarding the number of posibilities and the fact it should start by 10xxxxxx to respect UTF8 standard when we use the supplementary Unicode planes (what if not?).

3
Well, you could use an approach similar to yEnc whereby you only escape characters which have another meaning within the medium in which they're used, but I don't know of any existing implementations of such an approach for your particular case. See also this page.Aya

3 Answers

8
votes

Thank you Aya for the Asci85 link, there are very good ideas.

I developed them below for our case.


UTF-8 characters possibilites:


For 1 byte characters (0xxxxxxx): 96 possibilites per byte

  • + UTF-8 ASCII chars 0xxxxxxx = +2^7
  • - UTF-8 Control chars 000xxxxx = -2^5
  • + XML allowed UTF-8 control chars (00000009, 0000000A, 0000000D) = +3
  • - XML entity unallowed chars (<, >, &) = -3

EDIT: This is for XML1.0 specs. XML 1.1 specs allows the use of control chars except 0x00...

For 2-bytes characters (110xxxxx 10xxxxxx): 1920 possibilites per 2 bytes

  • + UTF-8 2-byte chars 110xxxxx 10xxxxxx = +2^11
  • - UTF-8 illegal non-canonical chars (1100000x 10xxxxxx) = -2^7

For 3-bytes characters (1110xxxx 10xxxxxx 10xxxxxx): 61440 possibilites per 3 bytes

  • + UTF-8 3-byte chars 1110xxxx 10xxxxxx 10xxxxxx = +2^16
  • - UTF-8 illegal non-canonical chars (11100000 100xxxxx 10xxxxxx) = -2^11
  • - Unicode reserved UTF-16 codepoints (11101101 101xxxxx 10xxxxxx) = -2^11

And I won't make the calculation for 4-bytes characters, that's pointless: the number of possibles would be negligible and there are too many illegal UTF-8 charactrs in this range.


The coding possibilites in let's say a 3-byte space


So let's see what combinations we can do on a 3-byte (24bit) space:

  • 0xxxxxxx 0xxxxxxx 0xxxxxxx : That's 96*96*96 = 884736 possibilities
  • 0xxxxxxx 110xxxxx 10xxxxxx : That's 96*1920 = 184320 possibilities
  • 110xxxxx 10xxxxxx 0xxxxxxx : That's 1920*96 = 184320 possibilities
  • 1110xxxx 10xxxxxx 10xxxxxx : That's 61440 = 61440 possibilities

There would be other possibilites (like a 3-bytes char ending or starting in the space but like 4-bytes chars, that would be difficult to evaluate (for me) and probably negligible).

Total number of possibilities:

  • A 3-byte space has 2^24 = 16777216 possibilities.
  • UTF-8 COMPATIBLE possibilites in that space is 884736+2*184320+61440 = 1314816 possibilities.

How much overhead does that mean?

  • 24-bit space usable bits : Log2(16777216)=24 (of course! that's for the math understanding)
  • This space's UTF-8 useful bits: Log2(1314816)=20,32 useful bits.
  • That means we need 24 bits of space to encode 20,32 bits of useful information, ie. the minimum theorical overhead is 18% overhead. Way better than Base64's 33% overhead and Ascii85's 25% overhead!

EDIT: This is for XML1.0 specs. With XML1.1 (not widely supported...), the theorical overhead is 12.55%. I managed to make a binary safe algorithm with 14.7% overhead for XML1.1.


How to get near this 18% overhead?


The bad news is that we can't get easily that 18% ovearhead without using a big "dictionnary" (ie long enconding sets). But it's easy to get 20%, and quite easy but less practical to get 19%.

Good coding lengths candidates:

  • 6 bits can encode 5 bits with 20% overhead (2^(6*0,84) > 2^5)
  • 12 bits can encode 10 bits with 20% overhead (2^(12*0,84) > 2^10)
  • 24 bits can encode 20 bits with 20% overhead (2^(24*0,84) > 2^20)
  • 25 bits can encode 21 bits with 19% overhead (2^(25*0,84) > 2^21)

NB: 0,84 is the average "usefulness" of a space bit (20,32/24).


How to build our encoding algorithm?


We need to build a "dictionnary" that will map the "space possibles" (randoms sequence of bits whose length is 5, 10, 20 or 21 bits depending on the chosen coding length for the algorithm - just choose one) into the utf8-compatible sequences (utf8 bit sequence whose length is 6, 12, 24 or 25 bits accordingly).

The easiest starting point would be the encoding of 20bits sequence into 24bits compatible UTF-8 sequences: that's exactly the example that was taken above to calculate the possibilites and that's 3 UTF-8 bytes long (so we won't have to worry about unterminated UTF8 chars).

Note that we MUST USE the 2-byte (or above) UTF-8 characters encoding space to reach the 20% overhead. With only 1-byte long UTF8 characters we can only reach 25% overhead with RADIX-24. However, the 3-byte long UTF-8 characters are needless to reach the 20% overhead.

That's the next challenge for this question. Who wants to to play? :)


A proposal of algorithm, I'll name BaseUTF-8 for XML


20 binary bits to encode: ABCDEFGHIJKLMNOPQRST

Resulting UTF-8 string named "encoded": 24 bits long

Mathematical encoding algorithm (not based on any known programming language):

If GH != 00 && NO != 00:
    encoded = 01ABCDEF 0GHIJKLM 0NOPQRST # 20 bits to encode, 21 space bits with restrictions (1-byte UTF-8 char not starting by 000xxxxx ie ASCII control chars)

If ABCD != 0000:
    If GH == 00 && NO == 00: # 16 bits to encode
        encoded = 0010ABCD 01EFIJKL 01MPQRST    
    Else If GH == 00:  # 18 bits to encode, 18 space bits with restrictions (1-byte  UTF-8 ASCII control char, 2-bytes UTF-8 char noncanonical)
        encoded = 0NOFIJKL 110ABCDE 10MPQRST
    Else If NO == 00:  # 18 bits to encode
        encoded = 110ABCDE 10MPQRST 0GHFIJKL

If ABCD == 0000: # 16 bits to encode
    encoded = 0011EFGH 01IJKLMN 01OPQRST

On "encoded" variable apply:
    convert < (0x3C) to Line Feed (0x0A)
    convert > (0x3E) to Cariage Return (0x0D)
    convert & (0x26) to TAB (0x09)

And that's how you get a 20% overhead only.

This algorithm doesn't provide yet a way to manage string termination, when the string to encode is not a multiple of 20. The decoding algorithm has also to be provided, but that's quite easy (just don't forget to throw exceptions to force the unicity of decoding).

2
votes

I have developed the concept in a C code.

The project is on GitHub and is finally called BaseXML: https://github.com/kriswebdev/BaseXML

It has a 20% overhead, which is good for a binary safe version.

I had a hard time making it work with Expat, which is the behind the scene XML parser of Python (THAT DOESN'T SUPPORT XML1.1!). So you'll find the BaseXML1.0 Binary safe version for XML1.0.

I will maybe release the "for XML1.1" version later if requested (it is also binary safe and have a 14.7% overhead), it's ready and working indeed but useless with Python built-in XML parsers so I don't want to confuse people with too many versions (yet).

1
votes

It's worse than that: you don't actually have 215 different byte values you can use. The resulting binary data have to be valid in whatever encoding the XML is represented in (which is almost certainly UTF-8), which means that many, many byte sequences are forbidden. 0xc2 followed by 0x41 would be just one random example. XML is text (a sequence of Unicode characters), not binary data. When transmitted, it is encoded using some encoding (which is almost alwats UTF-8). If you try to treat it as binary data, then you are, in my opinion, asking for way more trouble than it's worth dealing with.

If you still want to do this...

XML is text. So let's not try to encode your binary data as binary data. That will not lead to an easy or obvious way to showhorn it into an XML document. Let's try instead encoding your binary data as text!

Let's try one very simple encoding:

  • Group your binary data into blocks of 20 bits
  • Encode each group of 20 bits as the Unicode character U+10000 plus the numeric value of the 20 bits.

This will mean you exclusively use characters from planes 1 through 16. All of the restricted characters are in plane 0 (the BMP), so you are safe here.

When you then encode this XML document as UTF-8 for transmission, each of these characters will require 4 bytes to encode. So you consume 32 bits for every 20 bits of original data, which is 60% overhead compared to pure binary encoding of the original data. This is worse than base64's 33%, which makes it a terrible idea.

This encoding scheme is slightly wasteful because it makes no use of BMP characters. Can we use BMP characters to make it better? Not trivially. 20 is the largest size we can use for the groups (log(0x10FFFF) ~ 20.09). We could remap out scheme to use some as manu BMP characters as possible because these take less space to encode with UTF-8, but not only would this complicate the encoding a lot (the forbidden characters are scattered, so we have several cases to handle) but it can only lead to improvement for about 6.25% of bit patterns (fraction of Unicode characters that are in the BMP), and for the majority of that 6.25%, we'd save only one byte. For random data, the overhead decreases from 60% to around 55%. The result would still be much worse than base64 except for some very contrived data. Note that the overhead is data-dependant though. For 0.2% of bit patterns, you will actually get compression instead of overhead (60% compression for 0.012% of patterns and 20% compression for 0.18% of patterns). But these fractions are really low. It's just not worth it.

To put this another way: if you want to encode anything using 4-byte UTF-8 sequences, you need to use 32 bits per sequence (of course) but 11 of those bits are fixed and unchangeable: the bits must fit the pattern 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx and there are only 21 xs in there). That overhead of 60% is built in to UTF-8 so if you want to use this as the basis of any encoding that improves upon the overhead of base64, you are starting from behind!

I hope this convinces you that you can't improve on the density of base64 using any scheme of this type.