110
votes

I have two applications written in Java that communicate with each other using XML messages over the network. I'm using a SAX parser at the receiving end to get the data back out of the messages. One of the requirements is to embed binary data in an XML message, but SAX doesn't like this. Does anyone know how to do this?

UPDATE: I got this working with the Base64 class from the apache commons codec library, in case anyone else is trying something similar.

12

12 Answers

223
votes

You could encode the binary data using base64 and put it into a Base64 element; the below article is a pretty good one on the subject.

Handling Binary Data in XML Documents

212
votes

XML is so versatile...

<DATA>
  <BINARY>
    <BIT index="0">0</BIT>
    <BIT index="1">0</BIT>
    <BIT index="2">1</BIT>
    ...
    <BIT index="n">1</BIT>
  </BINARY>
</DATA>

XML is like violence - If it doesn't solve your problem, you're not using enough of it.

EDIT:

BTW: Base64 + CDATA is probably the best solution

(EDIT2:
Whoever upmods me, please also upmod the real answer. We don't want any poor soul to come here and actually implement my method because it was the highest ranked on SO, right?)

27
votes

Base64 is indeed the right answer but CDATA is not, that's basically saying: "this could be anything", however it must not be just anything, it has to be Base64 encoded binary data. XML Schema defines Base 64 binary as a primitive datatype which you can use in your xsd.

14
votes

I had this problem just last week. I had to serialize a PDF file and send it, inside an XML file, to a server.

If you're using .NET, you can convert a binary file directly to a base64 string and stick it inside an XML element.

string base64 = Convert.ToBase64String(File.ReadAllBytes(fileName));

Or, there is a method built right into the XmlWriter object. In my particular case, I had to include Microsoft's datatype namespace:

StringBuilder sb = new StringBuilder();
System.Xml.XmlWriter xw = XmlWriter.Create(sb);
xw.WriteStartElement("doc");
xw.WriteStartElement("serialized_binary");
xw.WriteAttributeString("types", "dt", "urn:schemas-microsoft-com:datatypes", "bin.base64");
byte[] b = File.ReadAllBytes(fileName);
xw.WriteBase64(b, 0, b.Length);
xw.WriteEndElement();
xw.WriteEndElement();
string abc = sb.ToString();

The string abc looks something that looks like this:

<?xml version="1.0" encoding="utf-16"?>
<doc>
    <serialized_binary types:dt="bin.base64" xmlns:types="urn:schemas-microsoft-com:datatypes">
        JVBERi0xLjMKJaqrrK0KNCAwIG9iago8PCAvVHlwZSAvSW5mbw...(plus lots more)
    </serialized_binary>
</doc>
6
votes

I usually encode the binary data with MIME Base64 or URL encoding.

5
votes

Try Base64 encoding/decoding your binary data. Also look into CDATA sections

4
votes

Maybe encode them into a known set - something like base 64 is a popular choice.

4
votes

Any binary-to-text encoding will do the trick. I use something like that

<data encoding="yEnc>
<![CDATA[ encoded binary data ]]>
</data>
4
votes

Base64 overhead is 33%.

BaseXML for XML1.0 overhead is only 20%. But it's not a standard and only have a C implementation yet. Check it out if you're concerned with data size. Note that however browsers tends to implement compression so that it is less needed.

I developed it after the discussion in this thread: Encoding binary data within XML : alternatives to base64.

4
votes

While the other answers are mostly fine, you could try another, more space-efficient, encoding method like yEnc. (yEnc wikipedia link) With yEnc also get checksum capability right "out of the box". Read and links below. Of course, because XML does not have a native yEnc type your XML schema should be updated to properly describe the encoded node.

Why: Due to the encoding strategies base64/63, uuencode et al. encodings increase the amount of data (overhead) you need to store and transfer by roughly 40% (vs. yEnc's 1-2%). Depending on what you're encoding, 40% overhead could be/become an issue.


yEnc - Wikipedia abstract: https://en.wikipedia.org/wiki/YEnc yEnc is a binary-to-text encoding scheme for transferring binary files in messages on Usenet or via e-mail. ... An additional advantage of yEnc over previous encoding methods, such as uuencode and Base64, is the inclusion of a CRC checksum to verify that the decoded file has been delivered intact. ‎

2
votes

You can also Uuencode you original binary data. This format is a bit older but it does the same thing as base63 encoding.

0
votes

If you have control over the XML format, you should turn the problem inside out. Rather than attaching the binary XML you should think about how to enclose a document that has multiple parts, one of which contains XML.

The traditional solution to this is an archive (e.g. tar). But if you want to keep your enclosing document in a text-based format or if you don't have access to an file archiving library, there is also a standardized scheme that is used heavily in email and HTTP which is multipart/* MIME with Content-Transfer-Encoding: binary.

For example if your servers communicate through HTTP and you want to send a multipart document, the primary being an XML document which refers to a binary data, the HTTP communication might look something like this:

POST / HTTP/1.1
Content-Type: multipart/related; boundary="qd43hdi34udh34id344"
... other headers elided ...

--qd43hdi34udh34id344
Content-Type: application/xml

<myxml>
    <data href="cid:data.bin"/>
</myxml>
--qd43hdi34udh34id344
Content-Id: <data.bin>
Content-type: application/octet-stream
Content-Transfer-Encoding: binary

... binary data ...
--qd43hdi34udh34id344--

As in above example, the XML refer to the binary data in the enclosing multipart by using a cid URI scheme which is an identifier to the Content-Id header. The overhead of this scheme would be just the MIME header. A similar scheme can also be used for HTTP response. Of course in HTTP protocol, you also have the option of sending a multipart document into separate request/response.

If you want to avoid wrapping your data in a multipart is to use data URI:

<myxml>
    <data href="data:application/something;charset=utf-8;base64,dGVzdGRhdGE="/>
</myxml>

But this has the base64 overhead.