2
votes

First things first, I can not change the output of the xml, it is being produced by a third party. They are inserting invalid characters in the the xml. I am given a InputStream of the byte stream representation of the xml. Is their a cleaner way to filter out the offending characters besides consuming the stream into a String and processing it? I found this: using a FilterReader but that doesn't work for me as I have a byte stream and not a character stream.

For what it's worth this is all part of a jaxb unmarshalling procedure, just in case that offers options.

We aren't willing to toss the whole stream if it has bad characters. We have decided to remove them and carry on.

Here is a FilterReader I tried to build.

public class InvalidXMLCharacterFilterReader extends FilterReader {

    private static final Log LOG = LogFactory
    .getLog(InvalidXMLCharacterFilterReader.class);

    public InvalidXMLCharacterFilterReader(Reader in) {
        super(in);
    }

    public int read() throws IOException {
        char[] buf = new char[1];
        int result = read(buf, 0, 1);
        if (result == -1)
        return -1;
        else
        return (int) buf[0];
    }

    public int read(char[] buf, int from, int len) throws IOException {
        int count = 0;
        while (count == 0) {
            count = in.read(buf, from, len);
            if (count == -1)
                return -1;

            int last = from;
            for (int i = from; i < from + count; i++) {
                LOG.debug("" + (char)buf[i]);
                if(!isBadXMLChar(buf[i])) {
                    buf[last++] = buf[i];
                }
            }

            count = last - from;
        }
        return count;
    }

    private boolean isBadXMLChar(char c) {
        if ((c == 0x9) ||
            (c == 0xA) ||
            (c == 0xD) ||
            ((c >= 0x20) && (c <= 0xD7FF)) ||
            ((c >= 0xE000) && (c <= 0xFFFD)) ||
            ((c >= 0x10000) && (c <= 0x10FFFF))) {
            return false;
        }
        return true;
    }

}

And here is how I am unmarshalling it:

jaxbContext = JAXBContext.newInstance(MyObj.class);
Unmarshaller unMarshaller = jaxbContext.createUnmarshaller();
Reader r = new InvalidXMLCharacterFilterReader(new BufferedReader(new InputStreamReader(is, "UTF-8")));
MyObj obj = (MyObj) unMarshaller.unmarshal(r);

and some example bad xml

<?xml version="1.0" encoding="UTF-8" ?>
<foo>
    bar&#x01;
</foo>
1
Are you sure that they are inserting invalid characters? Isn't it you who is reading the characters from the binary stream using the wrong encoding and/or displaying the read characters using the wrong encoding?BalusC
You should check BalusC's comment. If you still want to proceed with a FilteredReader implementation, then you have no problem transforming the byte stream into a reader (using InputStreamReader), provided that you KNOW the text encoding of the byte stream.Eyal Schneider
I don't know what BalusC is getting at. They are blatantly invalid XML 1.0 characters. I tried using an InputStreamReader (as well as wrapping that in a buffered reader) with no luck. I'll update my question with code.DanInDC
Can you give us an example of the invalid characters you are getting?Roland Illig
What is the current bad behaviour? Do you know your filter is failing to remove the bad characters, or are you getting an error in the unmarshalling that might be from another cause?Don Roby

1 Answers

1
votes

In order to do this with a filter, the filter needs to be XML entity aware, because (at least in your example and likely sometimes in actual use) the bad characters are in the xml as entities.

The filter is seeing your entity as a sequence of 6 perfectly acceptable characters and thus not stripping them.

The conversion that breaks JAXB is happening later in the process.