My goal is to split a large, single XML file that contains various content (about 2 to 15 GB) into multiple XML files, each containing a certain entity type, which can later be imported by an SQL database for example. I'm currently using Saxon-EE version 9.5.1.2J, but any other XSL processor would be ok if it does the job fast and reliable.
Here is what I already figured out:
- Saxon seems to be the de-facto standard processor for XSLT 3.0, while Raptor XML server seems to be another (more expensive) choice. Other XSL processors usually just support XSLT 1.0.
- Large files can be processed using XSLT 3.0 streams, so that the whole file is not required to fit in memory. Note: This feature is available in Saxon Enterprise Edition only.
- You can use
<xsl:result-document>to write the output to a different file, but you can not use it multiple times in the same stylesheet to write to the same file (apparently not thread-safe). <xsl:for-each-group>with group-by obviously is not streamable<xsl:stream>can only contain one<xsl:iterate>block, which is ok. But: Inside that iterate block, you can only access attributes of the current node and one child node (even<xsl:for-each>works on that one node). If you try to access the value of a second node, you get the "SXST0060: More than one subexpression consumes the input stream" error.<xsl:apply-templates>inside<xsl:stream>(instead of iterate) requires mode streamable (as shown bellow). However, the stream can only be consumed once as with iterate - otherwise you also get the error "SXST0060: More than one subexpression consumes the input stream".
My conclusion is, that currently available XSL processors require the use of multiple <xsl:stream> tags to write to different files, which in practice means to scan a large input file multiple times for each output file. This is even true, when writing different entities to the same output file as a workaround, since it's not possible to "consume" the same input stream more than once:
<xsl:mode name="s" streamable="yes"/>
<xsl:template match="/">
<xsl:stream href="input.xml">
<xsl:apply-templates mode="s" select="content/articles"/>
</xsl:stream>
<xsl:stream href="input.xml">
<xsl:apply-templates mode="s" select="content/articles/article/authors"/>
</xsl:stream>
</xsl:template>
It comes to a point where extracting different entities from a large XML file using an interpreted and much more complex command line script is faster - thus making XSLT slow and useless in comparison :(
I hope there is a XSLT 3.0 based solution out there that works as expected without scanning the input file multiple times? I don't see fundamental technical limitations of XSLT that prevent such a use case.