I'm new to XSLT. I need to aggregate some information of the contents of PDF files given in xml through pdf2txt.py. Some of the PDF are large (+100MB) and even larger is their xml output. Hence, it seems more efficient (time) to process all in memory piping the output through several xsltproc commands in order to prune the xml code from unneeded content. Among other things there is an xml node with a text content that I would like to convert into an attribute of its parent node.
More specifically, I have the following input XML file structure:
<?xml version="1.0"?>
<pages>
<page id="1">
<text bbox="2831.881,1170.243,3124.184,1192.535">text11</text>
<text bbox="3149.641,1291.323,3318.336,1313.615">sheet</text>
<text bbox="3149.641,1291.323,3318.336,1313.615">P793</text>
</page>
<page id="2">
<text bbox="2831.881,1170.243,3124.184,1192.535">text21</text>
<text bbox="3149.641,1291.323,3318.336,1313.615">sheet:</text>
<text bbox="3149.641,1291.323,3318.336,1313.615">S234</text>
</page>
</pages>
and I would like to transform it into (notice the added page attribute):
<?xml version="1.0"?>
<pages>
<page id="1" sheet="P793">
<text bbox="2831.881,1170.243,3124.184,1192.535">text11</text>
<text bbox="3149.641,1291.323,3318.336,1313.615">sheet</text>
<text bbox="3149.641,1291.323,3318.336,1313.615">P793</text>
</page>
<page id="2" sheet="S234">
<text bbox="2831.881,1170.243,3124.184,1192.535">text21</text>
<text bbox="3149.641,1291.323,3318.336,1313.615">sheet</text>
<text bbox="3149.641,1291.323,3318.336,1313.615">S234</text>
</page>
</pages>
Following the example in XSLT: Add Attribute to parent based on child attribute value containing a specific string I have tried with the following XSL stylesheet:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:preserve-space elements="text"/>
<xsl:template match="/">
<xsl:apply-templates/>
</xsl:template>
<xsl:template match="page">
<xsl:apply-templates select="@*"/>
<xsl:variable name="sheet" select="//text[contains(text(),'sheet')]/following::text[string-length()>3]"/>
<xsl:attribute name="sheet"><xsl:copy-of select="$sheet" /></xsl:attribute>
<xsl:apply-templates select="node()"/>
</xsl:template>
<xsl:template match="node()|@*">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>
However, I get no output. I tried substituting the variable trick with a for-each loop over the text nodes in order to define the new page attribute, but then I get the error I'm trying to add an attribute after adding child nodes, something I don't quite understand.
Is it possible to "look-ahead" for such a node value and add an attribute to the parent node using it? How? Why does my stylesheet doesn't give any output?
My final goal is to remove as well the XML text lines corresponding to the sheet nodes and their labels, but this seems simpler to solve than this look-ahead, attribute copy and I'll deal later with it.
Thanks!
EDIT: I simplified my input case and xsl stylesheet. Actually, with the examples I provided here there is an output, but it is an error output:
runtime error: file test.xsl line 18 element copy
Attribute nodes must be added before any child nodes to an element.
runtime error: file test.xsl line 13 element attribute
xsl:attribute: Cannot add attributes to an element if children have been already added to the element.
no result for -
And this is an error I haven't figure out yet how to deal with. Googling didn't help.