Java DOM transforming and parsing arbitrary strings with invalid XML characters?

Question

First of all I want to mention that this is not a duplicate of How to parse invalid (bad / not well-formed) XML? because I don't have a given invalid (or not well-formed) XML file but rather a given arbitrary Java String which may or may not contain an invalid XML character. I want to create a DOM Document containing a Text node with the given String, then transform it to a file. When the file is parsed to a DOM Document I want to get a String which is equal to the initial given String. I create the Text node with org.w3c.dom.Document#createTextNode(String data) and I get the String with org.w3c.dom.Node#getTextContent().

As you can see in https://stackoverflow.com/a/28152666/3882565 there are some invalid characters for Text nodes in a XML file. Actually there are two different types of "invalid" characters for Text nodes. There are predefined entities such as ", &, ', < and > which are automatically escaped by the DOM API with ", &, ', < and > in the resulting file which is undone by the DOM API when the file is parsed. Now the problem is that this is not the case for other invalid characters such as '\u0000' or '\uffff'. An exception occurs when parsing the file because '\u0000' and '\uffff' are invalid characters.

Probably I have to implement a method which escapes those characters in the given String in a unique way before submitting it to the DOM API and undo that later when I get the String back, right? Is there a better way to do this? Did someone implement those or similar methods in the past?

Edit: This question was marked as duplicate of Best way to encode text data for XML in Java?. I have now read all of the answers but none of them solves my problem. All of the answers suggest:

Using a XML library such as the DOM API which I already do and none of those libraries actually replaces invalid characters except ", &, ', <, > and a few more.
Replacing all invalid characters by "&#number;" which results in an exception for invalid characters such as "" when parsing the file.
Using a third party library with an XML encode method which do not support illegal characters such as "" (they are skipped in some libraries).
Using a CDATA section which doesn't support invalid characters either.

Why do you need any other characters escaped? Can you demonstrate that characters other than quotes, ampersands, less-than and greater-than are not coming back unaltered? — VGR
Ah, so you did. I know this isn’t much of an answer, but I would assume that the data is not really text data at all, so I would store it in a binary format like base64Binary. If you intend to transform the content itself, then you will indeed need to come up with some escape-like mechanism of representing those invalid characters. — VGR
This material has been covered thoroughly multiple times. Either strip the control et. al. characters that are illegal, or encode them in some manner such as Base64. There are libraries available to help you. Sorry, but there's just nothing unique about your question. Additional duplicates added. There are many more. — kjhughes
If it must be XML, I'll toss out the idea to maybe use a CDATA section, perhaps with the encoded Base64. Seems like the closest fit from the XML spec. — markspace
@markspace a CDATA section alone wouldn't solve the problem at all. Invalid XML characters (Unicode code points which are not in the ranges #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]) are also invalid in a CDATA section. In combination with Base64 encoding it will work, true. Thanks for suggesting the CDATA section. — stonar96

Michael Kay Michael Kay · Accepted Answer · 2020-01-01T22:10:57

One technique is to encode the whole string as Base64-encoded-UTF8.

But if the "special" characters are rare, that's a significant sacrifice in readability and file size.

Another technique is to represent special characters as processing instructions, for example <?U 0000?> for codepoint 0.

Another would be to use backslash escaping, for example \u0000 for codepoint 0, and of course \ for backslash itself. This has the advantage that you can probably find existing library routines that do this for you (for example JSON conversion libraries). I can't imagine why your requirements say you can't use such libraries; but if you really can't, then it's not hard to write the code yourself.

Java DOM transforming and parsing arbitrary strings with invalid XML characters?

3 Answers