0
votes

Please help me X-perts! I've got input XML documents that have a <body> XML tag enclosing "structured" text. E.g.:

<?xml version=1.0"?>
<d:Doc xmlns:d="urn:foo:bar">
<d:Body>
TITLE: An engaging topic with little to
no op-ed-ness (yes the title text wraps...)
PUBLICATION DATE: 24 March 2014
PUBLISHER: The Internet
AUTHOR: Jane Doe, Guy Smiley, Napoleon Dynamite
TEXT: Bacon ipsum dolor amet ut jerky flank, in 
aliqua kielbasa et meatball officia ea minim 
t-bone quis beef. Commodo pancetta chicken 
meatloaf consequat, eu tempor nisi et brisket 
occaecat aliquip shankle ut pork chop. Reprehenderit 
anim voluptate irure.
</d:Body>
</d:Doc>

...and I need to transform stuff like the above into something like this:

<?xml version="1.0"?>
<d:Doc xmlns:d="urn:foo:bar">
<d:Body>
<d:Pre qualifier="TITLE">TITLE: An engaging topic with little to
no op-ed-ness (yes the title text wraps...)</d:Pre>
<d:Pre qualifier="DATE">DATE: 24 March 2014</d:Pre>
<d:Pre qualfier="PUBLISHER">PUBLISHER: The Internet</d:Pre>
<d:Pre qualifier="AUTHOR">AUTHOR: Jane Doe, Guy Smiley, Napoleon Dynamite</d:Pre>
<d:Pre qualifer="TEXT">TEXT: Bacon ipsum dolor amet ut jerky flank, in 
aliqua kielbasa et meatball officia ea minim 
t-bone quis beef. Commodo pancetta chicken 
meatloaf consequat, eu tempor nisi et brisket 
occaecat aliquip shankle ut pork chop. Reprehenderit 
anim voluptate irure.</d:Pre>
</d:Body>
</d:Doc>

I'm trying to do this with an XSLT 2.0 stylesheet. The good news is the leading tokens (TITLE, DATE, AUTHOR, etc) are a controlled vocabulary; the bad news is the text following those tokens may or may not wrap onto one or more subsequent lines. Of course the XML that's produced must honor any namespaces in the original.

Any suggestions?

2

2 Answers

3
votes

Unfortunately the XSLT 2.0 regular expression language doesn't support zero-width lookaheads, so this is tricky to do in one step, but you could do it in two - first mark up the keywords, then extend the Pre elements to cover the following text.

<xsl:template match="d:Body">
  <xsl:copy>
    <xsl:variable name="step1" as="node()*">
      <xsl:analyze-string select="." regex="^(TITLE|DATE|PUBLISHER|AUTHOR|TEXT):"
                          flags="m">
        <xsl:matching-substring>
          <d:Pre qualifier="{regex-group(1)}"><xsl:value-of select="."/></d:Pre>
        </xsl:matching-substring>
        <xsl:non-matching-substring>
          <xsl:value-of select="."/>
        </xsl:non-matching-substring>
      </xsl:analyze-string>
    </xsl:variable>

    <!-- XXX -->

    <xsl:for-each-group select="$step1" group-starting-with="d:Pre">
      <xsl:if test="self::d:Pre"><!-- ignore the whitespace before the first Pre -->
        <d:Pre>
          <xsl:copy-of select="@qualifier" />
          <xsl:value-of select="current-group()" separator="" />
        </d:Pre>
      </xsl:if>
    </xsl:for-each-group>
  </xsl:copy>
</xsl:template>

At the point marked XXX the step1 variable contains an alternating sequence of text nodes and d:Pre elements that looks like this:

<d:Pre qualifier="TITLE">TITLE:</d:Pre> An engaging topic with little to
no op-ed-ness (yes the title text wraps...)
<d:Pre qualifier="DATE">DATE:</d:Pre> 24 March 2014
<d:Pre qualfier="PUBLISHER">PUBLISHER:</d:Pre> The Internet</d:Pre>
<d:Pre qualifier="AUTHOR">AUTHOR: Jane Doe, Guy Smiley, Napoleon Dynamite
<d:Pre qualifer="TEXT">TEXT:</d:Pre> Bacon ipsum dolor amet ut jerky flank, in 
aliqua kielbasa et meatball officia ea minim 
t-bone quis beef. Commodo pancetta chicken 
meatloaf consequat, eu tempor nisi et brisket 
occaecat aliquip shankle ut pork chop. Reprehenderit 
anim voluptate irure.

The for-each-group creates the final d:Pre elements covering everything up to the start of the next d:Pre:

<d:Pre qualifier="TITLE">TITLE: An engaging topic with little to
no op-ed-ness (yes the title text wraps...)
</d:Pre><d:Pre qualifier="DATE">DATE: 24 March 2014
</d:Pre><d:Pre qualfier="PUBLISHER">PUBLISHER: The Internet
</d:Pre><d:Pre qualifier="AUTHOR">AUTHOR: Jane Doe, Guy Smiley, Napoleon Dynamite
</d:Pre><d:Pre qualifer="TEXT">TEXT: Bacon ipsum dolor amet ut jerky flank, in 
aliqua kielbasa et meatball officia ea minim 
t-bone quis beef. Commodo pancetta chicken 
meatloaf consequat, eu tempor nisi et brisket 
occaecat aliquip shankle ut pork chop. Reprehenderit 
anim voluptate irure.
</d:Pre>

which is pretty much what you're after (except that the trailing newline following each section is inside its d:Pre rather than between each one and the next).

1
votes

Assuming XSLT 3.0 (I know you said XSLT 2.0 but Ian has already given you a nice XSLT 2.0 solution) and Saxon 9.6 PE or EE you could use

<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:xs="http://www.w3.org/2001/XMLSchema"
  exclude-result-prefixes="xs"
  xmlns:d="urn:foo:bar">


<xsl:param name="tokens" as="xs:string" select="'TITLE,PUBLICATION DATE,PUBLISHER,AUTHOR,TEXT'"/>
<xsl:param name="regex" as="xs:string" select="concat('^(', string-join(tokenize($tokens, ','), '|'), '):')"/>

<xsl:mode on-no-match="shallow-copy"/>

<xsl:output indent="yes"/>

<xsl:template match="d:Body">
  <xsl:copy>
    <xsl:for-each-group select="tokenize(., '\n')[normalize-space()]" group-starting-with=".[matches(., $regex)]">
      <d:Pre qualifier="{replace(., ':.*', '')}">
        <xsl:value-of select="current-group()" separator="&#10;"/>
      </d:Pre>
    </xsl:for-each-group>
  </xsl:copy>
</xsl:template>

</xsl:stylesheet>