0
votes

I have some trouble while retrieving unparsed entity URIs, with the XPath function unparsed-entity-uri().

I'm using a SAXTransformerFactory like in "Efficient XSLT pipeline in Java" question, because I need to perform a transformations chain (i.e. apply several XSLT transformations, and use the result of a transformation as input for the second transformation).

I discovered I'm unable to retrieve unparsed entity thank to the code below. Actually it works well with Xalan, but not with Saxon-HE (version 9.7.0) - but I need Saxon because I'd rather XSLT 2.0 (even if in the code below there's nothing specific to XSLT 2, it's only for the sake of providing an example). It also works with Saxon if I don't use a TransformerHandler, e.g. stf.newTransformer(new StreamSource("transfo.xsl")).transform(new StreamSource("input.xsl"), new StreamResult(System.out)) will produce the desired output.

Is there a configuration step that I forgot?

    // use "org.apache.xalan.processor.TransformerFactoryImpl" for Xalan
    String transformerFactoryClassName = "net.sf.saxon.TransformerFactoryImpl";
    SAXTransformerFactory stf = (SAXTransformerFactory) TransformerFactory.newInstance(transformerFactoryClassName,
            LaunchSimpleTransformationUnparsedEntities.class.getClassLoader());
    try {
        TransformerHandler thTransf = stf
                .newTransformerHandler(new StreamSource("transfo.xsl"));

        // output the result in console
        thTransf.setResult(new StreamResult(System.out));

        // Launch transformation of input.xml
        Transformer t = stf.newTransformer();
        t.transform(new StreamSource("input.xml"),
                new SAXResult(thTransf));

    } catch (TransformerConfigurationException e) {
        e.printStackTrace();
    } catch (TransformerException e) {
        e.printStackTrace();
    }

In input, I have (for input.xml):

<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE book
[<!ENTITY cover_hadrien SYSTEM "images/covers/cover_hadrien.jpg" NDATA jpeg>]>
<book>
  <title>Les mémoires d'Hadrien</title>
  <author>Marguerite Yourcenar</author>
  <cover imgref="cover_hadrien" />
</book>

and a sample XSLT (for transfo.xsl):

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">

    <xsl:template match="cover">
      <xsl:copy>
        <xsl:value-of select="unparsed-entity-uri(@imgref)"/>
      </xsl:copy>
    </xsl:template>

    <xsl:template match="@*|node()">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

as a result, I would expect something like:

<?xml version="1.0" encoding="UTF-8"?><book>
  <title>Les mémoires d'Hadrien</title>
  <author>Marguerite Yourcenar</author>
  <cover>images/covers/cover_hadrien.jpg</cover>
</book>

but <cover> is empty when performing the transformation with Saxon.

2

2 Answers

1
votes

Interesting observation. The issue in fact is not with Saxon's TransformerHandler, but rather with the "identity transformer" obtained using SAXTransformerFactory.newTransformer(): the identity transformer is not passing unparsed entities down the line. This is essentially because Saxon's identity transformer is reusing parts of the XSLT engine, and XSLT does not provide any way for a transformation to output unparsed entities in the result. If you sent the SAX parser output directly to the TransformerHandler, rather than going via an identity transformer, then I think it would all work.

As with all things JAXP-related, the specification of SAXTransformerFactory.newTransformer() is infuriatingly vague. All it says is that the returned Transformer performs a copy of the Source to the Result. i.e. the "identity transform". What exactly counts as a copy? I think Saxon's interpretation has been that it is equivalent to the effect of doing an XSLT identity transform - which would lose unparsed entities (as well as other things like CDATA sections, the DTD, etc).

Incidentally XSLT 2.0 specifies that the result of unparsed-entity-uri() should be an absolute URI (XSLT 1.0 doesn't say anything on the subject) so even if this is fixed, the Saxon output will be different.

Entered as a Saxon issue here: https://saxonica.plan.io/issues/3201 I think we need to be a bit careful about passing unparsed entities to a SAXResult if we don't pass all the other events expected by a SAX DTDHandler - and we're certainly not going to change the Saxon identity transformer to retain things (like DTD declarations) that aren't modelled in XDM.

0
votes

Indeed, following @MichaelKay's details, launching the transformation that way works properly:

        // launch transformation of input.xml
        XMLReader reader = XMLReaderFactory.createXMLReader();
        reader.setContentHandler(thTransf);
        reader.setDTDHandler(thTransf);
        reader.parse(new InputSource(input.xml"));

(this will replace the following line:

        // Launch transformation of input.xml
        Transformer t = stf.newTransformer();
        t.transform(new StreamSource("input.xml"),
                new SAXResult(thTransf));

that were used initially).