27
votes

It seems that all major browsers implement the DOMParser API so that XML can be parsed into a DOM and then queried using XPath, getElementsByTagName, etc...

However, detecting parsing errors seems to be trickier. DOMParser.prototype.parseFromString always returns a valid DOM. When a parsing error occurs, the returned DOM contains a <parsererror> element, but it's slightly different in each major browser.

Sample JavaScript:

xmlText = '<root xmlns="http://default" xmlns:other="http://other"><child><otherr:grandchild/></child></root>';
parser = new DOMParser();
dom = parser.parseFromString(xmlText, 'application/xml');
console.log((new XMLSerializer()).serializeToString(dom));

Result in Opera:

DOM's root is a <parsererror> element.

<?xml version="1.0"?><parsererror xmlns="http://www.mozilla.org/newlayout/xml/parsererror.xml">Error<sourcetext>Unknown source</sourcetext></parsererror>

Result in Firefox:

DOM's root is a <parsererror> element.

<?xml-stylesheet href="chrome://global/locale/intl.css" type="text/css"?>
<parsererror xmlns="http://www.mozilla.org/newlayout/xml/parsererror.xml">XML Parsing Error: prefix not bound to a namespace
Location: http://fiddle.jshell.net/_display/
Line Number 1, Column 64:<sourcetext>&lt;root xmlns="http://default" xmlns:other="http://other"&gt;&lt;child&gt;&lt;otherr:grandchild/&gt;&lt;/child&gt;&lt;/root&gt;
---------------------------------------------------------------^</sourcetext></parsererror>

Result in Safari:

The <root> element parses correctly but contains a nested <parsererror> in a different namespace than Opera and Firefox's <parsererror> element.

<root xmlns="http://default" xmlns:other="http://other"><parsererror xmlns="http://www.w3.org/1999/xhtml" style="display: block; white-space: pre; border: 2px solid #c77; padding: 0 1em 0 1em; margin: 1em; background-color: #fdd; color: black"><h3>This page contains the following errors:</h3><div style="font-family:monospace;font-size:12px">error on line 1 at column 50: Namespace prefix otherr on grandchild is not defined
</div><h3>Below is a rendering of the page up to the first error.</h3></parsererror><child><otherr:grandchild/></child></root>

Am I missing a simple, cross-browser way of detecting if a parsing error occurred anywhere in the XML document? Or must I query the DOM for each of the possible <parsererror> elements that different browsers might generate?

3
Can you just call .getElementsByTagName("parseerror") on the root DOM node and assume that there was an error if the length of the returned node list is greater than zero?Pointy
Technically the XML document I'm parsing could contain <parsererror> elements but still be totally valid XML (elements would be from a different namespace) So I'd have to make multiple calls to .getElementsByTagNameNS(namespace, 'parsererror') for the namespace URIs from each browser.cspotcode
Hmm. Well the HTML5 spec for this is fragmentary, to say the least.Pointy
I noticed this mozilla bug, which points to this whatwg spec. I think it is dumb for browsers to not use exceptions: as you wrote, we could have to parse XML documents similar to what is returned as an error, and be unable to tell if it worked or not. The only way to fully solve the problem is to write another parser.Damien
It looks like Chrome does the same thing as Safari.Tom Winter

3 Answers

21
votes

This is the best solution I've come up with.

I attempt to parse a string that is intentionally invalid XML and observe the namespace of the resulting <parsererror> element. Then, when parsing actual XML, I can use getElementsByTagNameNS to detect the same kind of <parsererror> element and throw a Javascript Error.

// My function that parses a string into an XML DOM, throwing an Error if XML parsing fails
function parseXml(xmlString) {
    var parser = new DOMParser();
    // attempt to parse the passed-in xml
    var dom = parser.parseFromString(xmlString, 'application/xml');
    if(isParseError(dom)) {
        throw new Error('Error parsing XML');
    }
    return dom;
}

function isParseError(parsedDocument) {
    // parser and parsererrorNS could be cached on startup for efficiency
    var parser = new DOMParser(),
        errorneousParse = parser.parseFromString('<', 'application/xml'),
        parsererrorNS = errorneousParse.getElementsByTagName("parsererror")[0].namespaceURI;

    if (parsererrorNS === 'http://www.w3.org/1999/xhtml') {
        // In PhantomJS the parseerror element doesn't seem to have a special namespace, so we are just guessing here :(
        return parsedDocument.getElementsByTagName("parsererror").length > 0;
    }

    return parsedDocument.getElementsByTagNameNS(parsererrorNS, 'parsererror').length > 0;
};

Note that this solution doesn't include the special-casing needed for Internet Explorer. However, things are much more straightforward in IE. XML is parsed with a loadXML method which returns true or false if parsing succeeded or failed, respectively. See http://www.w3schools.com/xml/xml_parser.asp for an example.

16
votes

When I came here the first time, I upvoted original answer (by cspotcode), however, it does not work in Firefox. The resulting namespace is always "null" because of the structure of the produced document. I made a little research (check the code here). The idea is to use not

invalidXml.childNodes[0].namespaceURI

but

invalidXml.getElementsByTagName("parsererror")[0].namespaceURI

And then select "parsererror" element by namespace as in original answer. However, if you have a valid XML document with <parsererror> tag in same namespace as used by browser, you end up with false alarm. So, here's a heuristic to check if your XML parsed successfully:

function tryParseXML(xmlString) {
    var parser = new DOMParser();
    var parsererrorNS = parser.parseFromString('INVALID', 'application/xml').getElementsByTagName("parsererror")[0].namespaceURI;
    var dom = parser.parseFromString(xmlString, 'application/xml');
    if(dom.getElementsByTagNameNS(parsererrorNS, 'parsererror').length > 0) {
        throw new Error('Error parsing XML');
    }
    return dom;
}

Why not implement exceptions in DOMParser?

Interesting thing worth mentioning in current context: if you try to get XML file with XMLHttpRequest, parsed DOM will be stored in responseXML property, or null, if XML file content was invalid. Not an exception, not parsererror or another specific indicator. Just null.

0
votes

In current browsers, the DOMParser appears to have two possible behaviours when given malformed XML:

  1. Discard the resulting document entirely — return a <parsererror> document with error details. Firefox and Edge seem to always take this approach; browsers from the Chrome family do this in most cases.

  2. Return the resulting document with one extra <parsererror> inserted as the root element's first child. Chrome's parser does this in cases where it's able to produce a root element despite finding errors in the source XML. The inserted <parsererror> may or may not have a namespace. The rest of the document seems to be left intact, including comments, etc. Refer to xml_errors.cc — search for XMLErrors::InsertErrorMessageBlock.

For (1), the way to detect an error is to add a node to the source string, parse it, check whether the node exists in the resulting document, then remove it. As far as I'm aware, the only way to achieve this without potentially affecting the result is to append a processing instruction or comment to the end of the source.

Example:

let key = `a`+Math.random().toString(32);

let doc = (new DOMParser).parseFromString(src+`<?${key}?>`, `application/xml`);

let lastNode = doc.lastChild;
if (!(lastNode instanceof ProcessingInstruction)
    || lastNode.target !== key
    || lastNode.data !== ``)
{
    /* the XML was malformed */
} else {
    /* the XML was well-formed */
    doc.removeChild(lastNode);
}

If case (2) occurs, the error won't be detected by the above technique, so another step is required.

We can leverage the fact that only one <parsererror> is inserted, even if there are multiple errors found in different places within the source. By parsing the source string again, by this time with a syntax error appended, we can ensure the (2) behaviour is triggered, then check whether the number of <parsererror> elements has changed — if not, the first parseFromString result already contained a true <parsererror>.

Example:

let errCount = doc.documentElement.getElementsByTagName(`parsererror`).length;
if (errCount !== 0) {
    let doc2 = parser.parseFromString(src+`<?`, `application/xml`);
    if (doc2.documentElement.getElementsByTagName(`parsererror`).length === errCount) {
        /* the XML was malformed */
    }
}

I put together a test page to verify this approach: https://github.com/Cauterite/domparser-tests.

It tests against the entire XML W3C Conformance Test Suite, plus a few extra samples to ensure it can distinguish documents containing <parsererror> elements from actual errors emitted by the DOMParser. Only a handful of test cases are excluded because they contain invalid unicode sequences.

To be clear, it is only testing whether the result is identical to XMLHttpRequest.responseXML for a given document.

You can run the tests yourself at https://cauterite.github.io/domparser-tests/index.html, but note that it uses ECMAScript 2018.

At time of writing, all tests pass in recent versions of Firefox, Chrome, Safari and Firefox on Android. Edge and Presto-based Opera should pass since their DOMParsers appear to behave like Firefox's, and current Opera should pass since it's a fork of Chromium.


Please let me know if you can find any counter-examples or possible improvements.

For the lazy, here's the complete function:

const tryParseXml = function(src) {
    /* returns an XMLDocument, or null if `src` is malformed */

    let key = `a`+Math.random().toString(32);

    let parser = new DOMParser;

    let doc = null;
    try {
        doc = parser.parseFromString(
            src+`<?${key}?>`, `application/xml`);
    } catch (_) {}

    if (!(doc instanceof XMLDocument)) {
        return null;
    }

    let lastNode = doc.lastChild;
    if (!(lastNode instanceof ProcessingInstruction)
        || lastNode.target !== key
        || lastNode.data !== ``)
    {
        return null;
    }

    doc.removeChild(lastNode);

    let errElemCount =
        doc.documentElement.getElementsByTagName(`parsererror`).length;
    if (errElemCount !== 0) {
        let errDoc = null;
        try {
            errDoc = parser.parseFromString(
                src+`<?`, `application/xml`);
        } catch (_) {}

        if (!(errDoc instanceof XMLDocument)
            || errDoc.documentElement.getElementsByTagName(`parsererror`).length
                === errElemCount)
        {
            return null;
        }
    }

    return doc;
}