3
votes

I have a very flat document which contains implied groups of elements based on their positioning after a Heading item:

<Document>
    <Body>
        ...
        <Heading>Section 1</Heading>
        <Item Id="1.1">Alpha</Item>
        <Item Id="1.1">Bravo</Item>
        ...
        <Heading>Section 2</Heading>
        <Item Id="2.1">Alpha</Item>
        <Item Id="2.1">Bravo</Item>
        ...
    </Body>
</Document>

From this document, I want to extract the groups, but also filter the items in each group to take the first items with a given identifier. For example, where there are two items with the ID "1.1", only the first item is expected in the output. I intend to do additional processing to include the duplicates as children of the first item.

To achieve this grouping, I am using Muenchian grouping, where the key for the group is the identifier value:

<xsl:key
    name="ItemsById"
    match="/Document/Body/Item"
    use="@Id"/>

This works great, except that there's a number of Item elements defined as examples that happen to use the same identifiers and winds up in the node-set matched in the key.

As there is a range in the middle of the document that I care about, I am using the Kayessian method of intersection to restrict the node-set to just the section in the document I am interested in:

<xsl:variable
    name="section"
    select="(/Document/Body/Heading[text() = 'Example']
        /following-sibling::*[2]/following-sibling::*)[
    count(. | /Document/Body/Heading[text() = 'Appendix B']
        /preceding-sibling::*) 
    = count(/Document/Body/Heading[text() = 'Appendix B']
        /preceding-sibling::*)
    ]" />

This node-set is the intersection of two node-sets: all the elements after the Heading "Section 1" (including the heading itself) and all the elements before the Heading "Appendix B".

This matches the elements I care about, however since the key is unfiltered, the "first" value for a given identifier is sometimes outside of this node-set. I have tried using the variable in the key, but I've since discovered that there are numerous restrictions on the match in a key which prevent the use of variables.

Here is the full source document:

<Document>
    <Body>

        <Heading>Preamble</Heading>
        <Para>
            Lorem ipsum dolor sit amet, consectetur
            adipiscing elit, sed do eiusmod tempor incididunt
            ut labore et dolore magna aliqua.
        </Para>

        <Heading>Example</Heading>
        <Item Id="1.1">Example Alpha</Item>
        <Item Id="1.1">Example Bravo</Item>

        <Heading>Section 1</Heading>
        <Item Id="1.1">Alpha</Item>
        <Item Id="1.1">Bravo</Item>
        <Item Id="1.2">Charlie</Item>
        <Item Id="1.3">Delta</Item>
        <Item Id="1.3">Echo</Item>
        <Item Id="1.4">Foxtrot</Item>

        <Heading>Section 2</Heading>
        <Item Id="2.1">Alpha</Item>
        <Item Id="2.1">Bravo</Item>
        <Item Id="2.2">Charlie</Item>
        <Item Id="2.3">Delta</Item>
        <Item Id="2.3">Echo</Item>
        <Item Id="2.4">Foxtrot</Item>

        <Heading>Appendix A</Heading>
        <Item Id="A.1">Alpha</Item>
        <Item Id="A.1">Bravo</Item>
        <Item Id="A.2">Charlie</Item>
        <Item Id="A.3">Delta</Item>
        <Item Id="A.3">Echo</Item>
        <Item Id="A.4">Foxtrot</Item>

        <Heading>Appendix B</Heading>
        <Para>
            Lorem ipsum dolor sit amet, consectetur
            adipiscing elit, sed do eiusmod tempor incididunt
            ut labore et dolore magna aliqua.
        </Para>

    </Body>
</Document>

I'm apply the following stylesheet:

<xsl:stylesheet
    version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:output method="xml" indent="yes"/>

    <xsl:template match="@* | node()">
        <xsl:copy>
            <xsl:apply-templates select="@* | node()"/>
        </xsl:copy>
    </xsl:template>

    <!-- The node-set which covers the wanted section of elements. -->
    <xsl:variable
        name="section"
        select="(/Document/Body/Heading[text() = 'Example']
            /following-sibling::*[2]/following-sibling::*)[
        count(. | /Document/Body/Heading[text() = 'Appendix B']
            /preceding-sibling::*) 
        = count(/Document/Body/Heading[text() = 'Appendix B']
            /preceding-sibling::*)
        ]" />

    <!-- The items keyed by their ID. -->
    <xsl:key
        name="ItemsById"
        match="/Document/Body/Item"
        use="@Id"/>

    <!-- Matches the root to begin the output structure. -->
    <xsl:template match="/">
        <Document>
            <!-- Apply templates to the headings. -->
            <xsl:apply-templates select="$section[local-name() = 'Heading']" />
        </Document>
    </xsl:template>

    <xsl:template match="/Document/Body/Heading">
        <Section>
            <xsl:attribute name="Title">
                <xsl:value-of select="."/>
            </xsl:attribute>

            <xsl:variable
                name="heading"
                select="generate-id()" />

            <!-- Apply templates to the items in this set. -->
            <xsl:apply-templates
                select="$section[
                local-name() = 'Item'
                and
                generate-id() = generate-id(key('ItemsById', @Id)[1])
                and
                $heading = generate-id(preceding-sibling::Heading[1])
                ]" />
        </Section>
    </xsl:template>

</xsl:stylesheet>

This is the current output:

<Document>
  <Section Title="Section 1">
    <Item Id="1.2">Charlie</Item>
    <Item Id="1.3">Delta</Item>
    <Item Id="1.4">Foxtrot</Item>
  </Section>
  <Section Title="Section 2">
    <Item Id="2.1">Alpha</Item>
    <Item Id="2.2">Charlie</Item>
    <Item Id="2.3">Delta</Item>
    <Item Id="2.4">Foxtrot</Item>
  </Section>
  <Section Title="Appendix A">
    <Item Id="A.1">Alpha</Item>
    <Item Id="A.2">Charlie</Item>
    <Item Id="A.3">Delta</Item>
    <Item Id="A.4">Foxtrot</Item>
  </Section>
</Document>

The issue is that the Item 1.1 is missing from Section 1.

Is there anything different I can try to achieve the same grouping over the section I'm interested in?

1
Good question, well asked. Sure that you're limited to XSLT 1.0?Mathias Müller
I'm running under .NET, so sadly yes.Paul Turner
How big are the documents? And how concerned are you about performance? There are ways without using keys, but they'd be much slower. (Btw, it's a shame .NET only supports 1.0...).Mathias Müller
The documents are 8Mb in size, but performance isn't a priority. Fast is nice, but not necessary.Paul Turner
Sorry, that's still not clear. Do you mean any section between Example and Appendix B, excluding the two?michael.hor257k

1 Answers

2
votes

Couldn't this be (much) simpler? For example, the following stylesheet:

XSLT 1.0

<xsl:stylesheet version="1.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>

<xsl:key name="item-by-heading" match="Item" use="generate-id(preceding-sibling::Heading[1])" />
<xsl:key name="item-by-id" match="Item" use="concat(generate-id(preceding-sibling::Heading[1]), '|', @Id)" />

<xsl:template match="/Document">
    <xsl:copy>
        <xsl:apply-templates select="Body/Heading"/>
    </xsl:copy>
</xsl:template>

<xsl:template match="Heading">
    <Section Title="{.}">
        <xsl:copy-of select="key('item-by-heading', generate-id())[count(. | key('item-by-id', concat(generate-id(preceding-sibling::Heading[1]), '|', @Id))[1]) = 1]"/>
    </Section>
</xsl:template> 

</xsl:stylesheet>

when applied to your input, will return:

<?xml version="1.0" encoding="UTF-8"?>
<Document>
   <Section Title="Preamble"/>
   <Section Title="Example">
      <Item Id="1.1">Example Alpha</Item>
   </Section>
   <Section Title="Section 1">
      <Item Id="1.1">Alpha</Item>
      <Item Id="1.2">Charlie</Item>
      <Item Id="1.3">Delta</Item>
      <Item Id="1.4">Foxtrot</Item>
   </Section>
   <Section Title="Section 2">
      <Item Id="2.1">Alpha</Item>
      <Item Id="2.2">Charlie</Item>
      <Item Id="2.3">Delta</Item>
      <Item Id="2.4">Foxtrot</Item>
   </Section>
   <Section Title="Appendix A">
      <Item Id="A.1">Alpha</Item>
      <Item Id="A.2">Charlie</Item>
      <Item Id="A.3">Delta</Item>
      <Item Id="A.4">Foxtrot</Item>
   </Section>
   <Section Title="Appendix B"/>
</Document>

I couldn't understand how you determine which sections you want to include in (or exclude from) the output, but that too should be easy.


Edit:

The sections I want is Sections 1-2 and Appendix A; no other sections are relevant.

Well, then just do:

<xsl:template match="/Document">
    <xsl:copy>
        <xsl:apply-templates select="Body/Heading[.='Section 1' or .='Section 2'or .='Appendix A']"/>
    </xsl:copy>
</xsl:template>

Note that if the items ids are not duplicated across sections, then this could be even simpler. Ah, but I see that they are. That is the reason why item 1.1 is missing.


Edit 2:

This node-set is the intersection of two node-sets: all the elements after the Heading "Section 1" (including the heading itself) and all the elements before the Heading "Appendix B".

Okay, so:

<xsl:template match="/Document">
    <xsl:copy>
        <xsl:apply-templates select="Body/Heading[.='Section 1' or preceding-sibling::Heading[.='Section 1'] and following-sibling::Heading[.='Appendix B']]"/>
    </xsl:copy>
</xsl:template>

Or, even shorter:

<xsl:template match="/Document">
    <xsl:copy>
        <xsl:apply-templates select="Body/Heading[not(following-sibling::Heading[.='Section 1']) and following-sibling::Heading[.='Appendix B']]"/>
    </xsl:copy>
</xsl:template>