1
votes

I'm new to XSLT. I need to aggregate some information of the contents of PDF files given in xml through pdf2txt.py. Some of the PDF are large (+100MB) and even larger is their xml output. Hence, it seems more efficient (time) to process all in memory piping the output through several xsltproc commands in order to prune the xml code from unneeded content. Among other things there is an xml node with a text content that I would like to convert into an attribute of its parent node.

More specifically, I have the following input XML file structure:

<?xml version="1.0"?>
<pages>
  <page id="1">
    <text bbox="2831.881,1170.243,3124.184,1192.535">text11</text>
    <text bbox="3149.641,1291.323,3318.336,1313.615">sheet</text>
    <text bbox="3149.641,1291.323,3318.336,1313.615">P793</text>
  </page>
  <page id="2">
    <text bbox="2831.881,1170.243,3124.184,1192.535">text21</text>
    <text bbox="3149.641,1291.323,3318.336,1313.615">sheet:</text>
    <text bbox="3149.641,1291.323,3318.336,1313.615">S234</text>
  </page>
</pages>

and I would like to transform it into (notice the added page attribute):

<?xml version="1.0"?>
<pages>
  <page id="1" sheet="P793">
    <text bbox="2831.881,1170.243,3124.184,1192.535">text11</text>
    <text bbox="3149.641,1291.323,3318.336,1313.615">sheet</text>
    <text bbox="3149.641,1291.323,3318.336,1313.615">P793</text>
  </page>
  <page id="2" sheet="S234">
    <text bbox="2831.881,1170.243,3124.184,1192.535">text21</text>
    <text bbox="3149.641,1291.323,3318.336,1313.615">sheet</text>
    <text bbox="3149.641,1291.323,3318.336,1313.615">S234</text>
  </page>
</pages>

Following the example in XSLT: Add Attribute to parent based on child attribute value containing a specific string I have tried with the following XSL stylesheet:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="no" indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:preserve-space elements="text"/>

<xsl:template match="/">
 <xsl:apply-templates/>
</xsl:template>

<xsl:template match="page">
   <xsl:apply-templates select="@*"/>
  <xsl:variable name="sheet" select="//text[contains(text(),'sheet')]/following::text[string-length()>3]"/>
  <xsl:attribute name="sheet"><xsl:copy-of select="$sheet" /></xsl:attribute>
   <xsl:apply-templates select="node()"/>
</xsl:template>

<xsl:template match="node()|@*">
  <xsl:copy>
   <xsl:apply-templates select="@*|node()"/>
  </xsl:copy>
</xsl:template>

</xsl:stylesheet>

However, I get no output. I tried substituting the variable trick with a for-each loop over the text nodes in order to define the new page attribute, but then I get the error I'm trying to add an attribute after adding child nodes, something I don't quite understand.

Is it possible to "look-ahead" for such a node value and add an attribute to the parent node using it? How? Why does my stylesheet doesn't give any output?

My final goal is to remove as well the XML text lines corresponding to the sheet nodes and their labels, but this seems simpler to solve than this look-ahead, attribute copy and I'll deal later with it.

Thanks!

EDIT: I simplified my input case and xsl stylesheet. Actually, with the examples I provided here there is an output, but it is an error output:

runtime error: file test.xsl line 18 element copy
Attribute nodes must be added before any child nodes to an element.
runtime error: file test.xsl line 13 element attribute
xsl:attribute: Cannot add attributes to an element if children have been already added to the element.
no result for -

And this is an error I haven't figure out yet how to deal with. Googling didn't help.

1

1 Answers

2
votes

The main problem is in the template matching page, where the first thing you do is create an attribute

<xsl:template match="page">
    <xsl:apply-templates select="@*"/>

But you have not actually copied the page element first, so it will try to add the attribute, and child text nodes, onto the previous element that was created; namely pages. For the second page element matched it will try to do the same thing, but error because you cannot add attributes to elements which have already had child elements added.

Try this template instead

<xsl:template match="page">
    <xsl:copy>
       <xsl:apply-templates select="@*"/>
        <xsl:variable name="sheet" select="text[contains(text(),'sheet')]/following-sibling::text[string-length()>3]"/>
        <xsl:attribute name="sheet"><xsl:value-of select="$sheet" /></xsl:attribute>
        <xsl:apply-templates select="node()"/>
    </xsl:copy>
</xsl:template>

Note the change in the expression for sheet. Previously you were starting it with //text, which will find the very first text element anywhere in the document. The // need to be removed, to make it relative to the current page node.

Additionally, note the use of following-sibling, rather than following so that it restricts it self to only the sibling nodes under the current page element.

Finally, is it only the immediately following-sibling you want to access? If so, you might need to add an extra condition to the expression

<xsl:variable name="sheet" select="text[contains(text(),'sheet')]/following-sibling::text[1][string-length()>3]"/>

Or perhaps reverse the logic, and write it this way instead

<xsl:variable name="sheet" select="text[string-length()>3][contains(preceding-sibling::text[1],'sheet')]"/>