0
votes

I have an XML file with several translation units. "Content a" and "Content b" both have translations in German and French. "Content a" and "Content b" both appear twice in this file.

<unit>
     <src lang="en">Content a</src>
     <trg lang="de">Translation content a</trg>
</unit>

<unit>
     <src lang="en">Content a</src>
     <trg lang="fr">Translation content a</trg>
</unit>

<unit>
     <src lang="en">Content b</src>
     <trg lang="de">Translation content b</trg>
</unit>

<unit>
     <src lang="en">Content b</src>
     <trg lang="fr">Translation content b</trg>
 </unit>

My aim is to avoid duplicates, so this is my desired output:

    <unit>
         <src lang="en">Content a</src>
         <trg lang="de">Translation content a</trg>
         <trg lang="fr">Translation content a</trg>
    </unit>

    <unit>
         <src lang="en">Content b</src>
         <trg lang="de">Translation content b</trg>
         <trg lang="fr">Translation content b</trg>
    </unit>


     <unit>

My stylesheet so far:

<xsl:template match="unit">
        <xsl:copy>
            <xsl:copy-of select="src"/>
            <xsl:for-each-group select="current-group()/(* except src)" group-by="node-name(.)">
                <xsl:for-each-group select="current-group()" group-by=".">
                    <xsl:copy-of select="."/>
                </xsl:for-each-group>
            </xsl:for-each-group>
        </xsl:copy>
    </xsl:template>

It produces the following output:

   <unit>
         <src lang="en">Content a</src>
         <trg lang="de">Translation content a</trg>
         <trg lang="fr">Translation content a</trg>
         <trg lang="de">Translation content b</trg>
         <trg lang="fr">Translation content b</trg>
    </unit>

Thanks for any help.

1

1 Answers

0
votes

Given a well-formed(!) input such as:

<units>
    <unit>
         <src lang="en">Content a</src>
         <trg lang="de">Translation content a</trg>
    </unit>

    <unit>
         <src lang="en">Content a</src>
         <trg lang="fr">Translation content a</trg>
    </unit>

    <unit>
         <src lang="en">Content b</src>
         <trg lang="de">Translation content b</trg>
    </unit>

    <unit>
         <src lang="en">Content b</src>
         <trg lang="fr">Translation content b</trg>
     </unit>
</units>

you can use:

XSLT 2.0

<xsl:stylesheet version="2.0" 
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/>

<xsl:template match="/units">
    <xsl:copy>
        <xsl:for-each-group select="unit" group-by="src">
            <unit>
                <xsl:copy-of select="src"/>
                <xsl:copy-of select="current-group()/trg"/>
            </unit>
        </xsl:for-each-group>
    </xsl:copy>
</xsl:template>

</xsl:stylesheet>

to produce:

<?xml version="1.0" encoding="UTF-8"?>
<units>
   <unit>
      <src lang="en">Content a</src>
      <trg lang="de">Translation content a</trg>
      <trg lang="fr">Translation content a</trg>
   </unit>
   <unit>
      <src lang="en">Content b</src>
      <trg lang="de">Translation content b</trg>
      <trg lang="fr">Translation content b</trg>
   </unit>
</units>

This is assuming all src elements have lang ="en". If this is not a valid assumption, use:

<xsl:for-each-group select="unit" group-by="concat(src/@lang, '|', src)">

instead.