Using a single XSL stream to split one large XML file into multiple files

Question

My goal is to split a large, single XML file that contains various content (about 2 to 15 GB) into multiple XML files, each containing a certain entity type, which can later be imported by an SQL database for example. I'm currently using Saxon-EE version 9.5.1.2J, but any other XSL processor would be ok if it does the job fast and reliable.

Here is what I already figured out:

Saxon seems to be the de-facto standard processor for XSLT 3.0, while Raptor XML server seems to be another (more expensive) choice. Other XSL processors usually just support XSLT 1.0.
Large files can be processed using XSLT 3.0 streams, so that the whole file is not required to fit in memory. Note: This feature is available in Saxon Enterprise Edition only.
You can use <xsl:result-document> to write the output to a different file, but you can not use it multiple times in the same stylesheet to write to the same file (apparently not thread-safe).
<xsl:for-each-group> with group-by obviously is not streamable
<xsl:stream> can only contain one <xsl:iterate> block, which is ok. But: Inside that iterate block, you can only access attributes of the current node and one child node (even <xsl:for-each> works on that one node). If you try to access the value of a second node, you get the "SXST0060: More than one subexpression consumes the input stream" error.
<xsl:apply-templates> inside <xsl:stream> (instead of iterate) requires mode streamable (as shown bellow). However, the stream can only be consumed once as with iterate - otherwise you also get the error "SXST0060: More than one subexpression consumes the input stream".

My conclusion is, that currently available XSL processors require the use of multiple <xsl:stream> tags to write to different files, which in practice means to scan a large input file multiple times for each output file. This is even true, when writing different entities to the same output file as a workaround, since it's not possible to "consume" the same input stream more than once:

<xsl:mode name="s" streamable="yes"/>

<xsl:template match="/">
    <xsl:stream href="input.xml">
        <xsl:apply-templates mode="s" select="content/articles"/>
    </xsl:stream>

    <xsl:stream href="input.xml">
        <xsl:apply-templates mode="s" select="content/articles/article/authors"/>
    </xsl:stream>
</xsl:template>

It comes to a point where extracting different entities from a large XML file using an interpreted and much more complex command line script is faster - thus making XSLT slow and useless in comparison :(

I hope there is a XSLT 3.0 based solution out there that works as expected without scanning the input file multiple times? I don't see fundamental technical limitations of XSLT that prevent such a use case.

lastzero lastzero · Accepted Answer · 2013-09-25T13:30:12

The problem was actually quite easy to solve: Using copy-of() enables you to access the node and all sub nodes (such as name in the example bellow) inside a single iterate block:

<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output method="xml"
                omit-xml-declaration="no"
                encoding="UTF-8"
                indent="yes"/>

    <xsl:template match="/">
        <xsl:stream href="input.xml">
            <resultset xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
                <xsl:iterate select="content/articles/article">
                    <xsl:for-each select="copy-of()/.">
                        <xsl:apply-templates select="."/>
                        <xsl:apply-templates select="authors/author"/>
                    </xsl:for-each>
                </xsl:iterate>
            </resultset>
        </xsl:stream>
    </xsl:template>

    <xsl:template match="article">
        ...
    </xsl:template>

    <xsl:template match="author">
        ...
    </xsl:template>
</xsl:stylesheet>

Note: Putting copy-of() directly in <xsl:iterate> doesn't work and you'll get an OutOfMemoryError for large documents. Mode streamable is not required.

Saxon can process about 1 GB of XML per minute on my MacBook Air this way. Right now, I'm still writing all entities to the same output file, but MySQL can filter which nodes are imported into each table (http://dev.mysql.com/doc/refman/5.5/en/load-xml.html), so the workaround is not a major issue. If you find out how to write the output into alternating output files, please let me know.

I also got feedback on this issue directly from Michael Kay (Saxonica):

Yes, once you make more than one downward selection, you need to manually organize some buffering of data by using copy-of(). I'm hoping to find other ways the restriction can be relaxed in Saxon as well to make things a bit easier.

Using a single XSL stream to split one large XML file into multiple files

2 Answers