3
votes

-- Modified question --

Thanks already for all who provided potential solutions, but these are in line whith what I tried already, so I assume I should have been more clear. I extended the XML a bit to make the problem more transparent.

The XML is actually a compilation of various files, containing translated content, and the aim is to get a unified document containing only the unique English strings, and (after manual review and cleaning) a single translated one for each string, so it can be used for translation memory. That's why it's now a big file with loads of redundant information.

Each para line contains the English master (which can be repeated dozens of times within the file) and the translation variants. In quite some cases it's easy as all translated versions are equal, so I would end up with a single line, but in other cases it might be more complex.

So, assume today I have 10 para lines containing the same English content (#1), 2 different German variations, 3 different French variations, and the rest of locales only one variation I need to get :

1 Para having : 1 EN / 2 DE (v1 and v2) / 3 FR (v1,v2 and v3) / ...

And this repeated for every grouped unique English value in my list

The modified XML :

<Books>
<!--First English String (#1) with number of potential translations -->
<Para>
    <EN>English Content #1</EN>
    <DE>German Trans of #1 v1</DE>
    <FR>French Trans of #1 v1</FR>
    <!-- More locales here -->
</Para>
<Para>
    <EN>English Content #1</EN>
    <DE>German Trans of #1 v2</DE>
    <FR>French Trans of #1 v1</FR>
    <!-- More locales here -->
</Para>
<Para>
    <EN>English Content #1</EN>
    <DE>German Trans of #1 v1</DE>
    <FR>French Trans of #1 v2</FR>
    <!-- More locales here -->
</Para>
<!--Second English String (#2) with number of potential translations -->
<Para>
    <EN>English Content #2</EN>
    <DE>German Trans of #2 v1</DE>
    <FR>French Trans of #2 v1</FR>
    <!-- More locales here -->
</Para>
<Para>
    <EN>English Content #2</EN>
    <DE>German Trans of #2 v3</DE>
    <FR>French Trans of #2 v1</FR>
    <!-- More locales here -->
</Para>
<Para>
    <EN>English Content #2</EN>
    <DE>German Trans of #2 v2</DE>
    <FR>French Trans of #2 v1</FR>
    <!-- More locales here -->
</Para>
<!--Loads of additional English Strings (#3 ~ #n) with number of potential    translations -->

Current solutions offer me the following output

<Books>
<Para>
    <EN>English Content #1</EN>
    <DE>German Trans of #1 v1</DE>
    <DE>German Trans of #1 v2</DE>
    <DE>German Trans of #2 v1</DE>
    <DE>German Trans of #2 v3</DE>
    <DE>German Trans of #2 v2</DE>
    <FR>French Trans of #1 v1</FR>
    <FR>French Trans of #1 v1</FR>
    <FR>French Trans of #1 v2</FR>
    <FR>French Trans of #2 v1</FR>
</Para>
</Books>

So, taking only the first EN tag, and then grouping all the others, irrelevant of differences between English master strings. While what I aim at is to get the following :

<Books>
<!-- First Grouped EN string and linked grouped translations -->
<Para>
    <EN>English Content #1</EN>
    <DE>German Trans of #1 v1</DE>
    <DE>German Trans of #1 v2</DE>
    <FR>French Trans of #1 v1</FR>
    <FR>French Trans of #1 v2</FR>
</Para>
<!-- Second Grouped EN string and linked grouped translations -->
<Para>
    <EN>English Content #2</EN>
    <DE>German Trans of #2 v1</DE>
    <DE>German Trans of #2 v3</DE>
    <DE>German Trans of #2 v2</DE>
    <FR>French Trans of #2 v1</FR>
</Para>
<!-- 3d to n Grouped EN string and linked grouped translations -->
</Books>
4
Your example is confusing, especially due to duplicate <EN></EN> values. Can you show your first stab at XSLT as well, to show your existing logic?Merlyn Morgan-Graham
+1 for good question about grouping elements by their content.Emiliano Poggi
Good question, +1. See my answer for a solution that works correctly even for close languages that have exactly the same translation :)Dimitre Novatchev
When you ask questions about grouping in XSLT, the answer will be completely different depending whether you are using XSLT 1.0 or XSLT 2.0, so you really need to specify what your constraints are.Michael Kay
Thanks guys. The duplication is on purpose, unfortunately that's how my files are structured (see below for more info on why and how). @Michael : Valid point, I'm using XSLT 2.0, and used for-each-group but without much success due to nesting problems / lack of knowhowWokoman

4 Answers

2
votes

Expanded XSLT 2.0 answer to fulfill the update in the question requests

<xsl:stylesheet version="2.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output omit-xml-declaration="yes" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="Books">
        <xsl:copy>
            <xsl:for-each-group select="*" 
                group-by="EN">
                <xsl:copy>
                   <xsl:copy-of select="EN"/>
                   <xsl:for-each-group select="current-group()/*[not(local-name()='EN')]"
                        group-by=".">
                        <xsl:sort select="local-name()"/>
                        <xsl:copy-of select="."/>
                    </xsl:for-each-group>
                </xsl:copy>
            </xsl:for-each-group>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

Expanded XSLT 1.0 answer to fulfill the update in the question requests

You can still go with same kind of solution, even if you need two different type of keys. This is the first easy solution which comes into mind:

<xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output omit-xml-declaration="yes" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:key name="main" match="Para" use="EN"/>
    <xsl:key name="locale" match="Para/*[not(self::EN)]" use="concat(../EN,.)"/>

    <xsl:template match="Books">
        <xsl:copy>
            <xsl:apply-templates select="Para[
                generate-id()
                = generate-id(key('main',EN)[1])]" mode="EN"/>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="*" mode="EN">
        <xsl:copy>
            <xsl:copy-of select="EN"/>
            <xsl:apply-templates select="../Para/*[
                generate-id()
                = generate-id(key('locale',concat(current()/EN,.))[1])]" mode="locale">
                <xsl:sort select="local-name()"/>
            </xsl:apply-templates>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="*" mode="locale">
        <xsl:copy>
            <xsl:value-of select="."/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

When applied o

n the new provided input, produces:

<Books>
    <Para>
        <EN>English Content #1</EN>
        <DE>German Trans of #1 v1</DE>
        <DE>German Trans of #1 v2</DE>
        <FR>French Trans of #1 v1</FR>
        <FR>French Trans of #1 v2</FR>
    </Para>
    <Para>
        <EN>English Content #2</EN>
        <DE>German Trans of #2 v1</DE>
        <DE>German Trans of #2 v3</DE>
        <DE>German Trans of #2 v2</DE>
        <FR>French Trans of #2 v1</FR>
    </Para>
</Books>

This XSLT 1.0 transform does exactly what you are asking for, and it can be used as starting point to create a result tree more meaningful if you like:

 <xsl:stylesheet version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output omit-xml-declaration="yes" indent="yes"/>
    <xsl:strip-space elements="*"/>


    <xsl:key name="locale" match="Para/*[not(local-name()='EN')]" use="text()"/>

    <xsl:template match="Books">
        <xsl:copy>
            <Para>
                <xsl:copy-of select="Para[1]/EN"/>
                <xsl:apply-templates select="Para/*[
                    generate-id()
                    = generate-id(key('locale',text())[1])]" mode="group">
                    <xsl:sort select="local-name()"/>
                </xsl:apply-templates>
            </Para>
        </xsl:copy>
    </xsl:template>

    <xsl:template match="*" mode="group">
        <xsl:copy>
            <xsl:value-of select="."/>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>

Explanation:

  • xsl:key used to group all elements by content (but EN)
  • Simple direct copy of first PARA/EN node
  • Meunchian method of grouping with xsl:sort to output the other elements grouped as requested (elements with same content reported once)

When applied to the input provided in the question, the result tree is:

<Books>
   <Para>
      <EN>Some English Content</EN>
      <DE>German Trans v1</DE>
      <DE>German Trans v2</DE>
      <FR>French Trans v1</FR>
      <FR>French Trans v2</FR>
   </Para>
</Books>

Same result (and shorter transform) with XSLT 2.0 xsl:for-each-group:

<xsl:stylesheet version="2.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    <xsl:output omit-xml-declaration="yes" indent="yes"/>
    <xsl:strip-space elements="*"/>

    <xsl:template match="Books">
        <xsl:copy>
            <Para>
                <xsl:copy-of select="Para[1]/EN"/>
                <xsl:for-each-group select="Para/*[not(local-name()='EN')]" 
                            group-by=".">
                    <xsl:sort select="local-name()"/>
                    <xsl:copy>
                        <xsl:value-of select="."/>
                    </xsl:copy>
                </xsl:for-each-group>
            </Para>
        </xsl:copy>
    </xsl:template>

</xsl:stylesheet>
1
votes

This transformation:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:key name="kLangByValAndText"
  match="Para/*[not(self::EN)]"
  use="concat(name(), '+++', .)"/>

 <xsl:template match="/">
  <Books>
   <Para>
    <xsl:copy-of select="/*/Para[1]/EN"/>
    <xsl:for-each select=
    "/*/*/*[generate-id()
           =
            generate-id(key('kLangByValAndText',
                            concat(name(), '+++', .)
                            )
                            [1]
                       )
           ]
    ">
     <xsl:sort select="name()"/>
     <xsl:copy-of select="."/>
    </xsl:for-each>
   </Para>
  </Books>
 </xsl:template>
</xsl:stylesheet>

when applied on this XML document (an extended version of the provided one to make it more interesting):

<Books>
    <Para>
        <EN>Some English Content</EN>
        <DE>German Trans v1</DE>
        <FR>French Trans v1</FR>
        <!-- More locales here -->
    </Para>
    <Para>
        <EN>Some English Content</EN>
        <EN-US>Some English Content</EN-US>
        <DE>German Trans v1</DE>
        <FR>French Trans v1</FR>
        <!-- More locales here -->
    </Para>
    <Para>
        <EN>Some English Content</EN>
        <Australian>Some English Content</Australian>
        <DE>German Trans v1</DE>
        <FR>French Trans v2</FR>
        <!-- More locales here -->
    </Para>
    <!-- Much more para's hereafter containing variety of <EN> Content -->
</Books>

Produces the wanted, correct result:

<Books>
   <Para>
      <EN>Some English Content</EN>
      <Australian>Some English Content</Australian>
      <DE>German Trans v1</DE>
      <EN-US>Some English Content</EN-US>
      <FR>French Trans v1</FR>
      <FR>French Trans v2</FR>
   </Para>
</Books>

Explanation: Muenchian grouping on a composite (2-part) key.

Do note: Grouping only on the translation (as done in another answer to this question) loses the <Australian> translation -- apply the solution by @empo to this same document, and the result is (<Australian> is lost!):

<Books>
   <Para>
      <EN>Some English Content</EN>
      <DE>German Trans v1</DE>
      <EN-US>Some English Content</EN-US>
      <FR>French Trans v1</FR>
      <FR>French Trans v2</FR>
   </Para>
</Books>
0
votes

Another muenchian grouping, with compound keys for the sub-level:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0"  xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
  <xsl:output indent="yes" />
  <xsl:key name="english" match="EN" use="." />
  <xsl:key name="others" match="Para/*[not(self::EN)]" use="concat(../EN, '&#160;', ., '&#160;', name())" />
  <xsl:template match="/Books">
    <Books>
      <xsl:for-each select="Para/EN[generate-id() = generate-id(key('english', .)[1])]">
        <Para>
          <xsl:copy-of select=".|key('english', .)/../*[not(self::EN)][generate-id() = generate-id(key('others', concat(current(), '&#160;', ., '&#160;', name()))[1])]" />
        </Para>
      </xsl:for-each>
    </Books>
  </xsl:template>
</xsl:stylesheet>
0
votes

With Saxon 9, when I apply the stylesheet

<xsl:stylesheet
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  version="2.0">

  <xsl:strip-space elements="*"/>
  <xsl:output indent="yes"/>

  <xsl:template match="Books">
    <xsl:copy>
      <xsl:for-each-group select="Para" group-by="EN">
        <xsl:apply-templates select="."/>
      </xsl:for-each-group>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="Para">
    <xsl:copy>
      <xsl:copy-of select="EN"/>
      <xsl:for-each-group select="current-group()/(* except EN)" group-by="node-name(.)">
        <xsl:for-each-group select="current-group()" group-by=".">
          <xsl:copy-of select="."/>
        </xsl:for-each-group>
      </xsl:for-each-group>
    </xsl:copy>
  </xsl:template>

</xsl:stylesheet>

to the input

<Books>
<!--First English String (#1) with number of potential translations -->
<Para>
    <EN>English Content #1</EN>
    <DE>German Trans of #1 v1</DE>
    <FR>French Trans of #1 v1</FR>
    <!-- More locales here -->
</Para>
<Para>
    <EN>English Content #1</EN>
    <DE>German Trans of #1 v2</DE>
    <FR>French Trans of #1 v1</FR>
    <!-- More locales here -->
</Para>
<Para>
    <EN>English Content #1</EN>
    <DE>German Trans of #1 v1</DE>
    <FR>French Trans of #1 v2</FR>
    <!-- More locales here -->
</Para>
<!--Second English String (#2) with number of potential translations -->
<Para>
    <EN>English Content #2</EN>
    <DE>German Trans of #2 v1</DE>
    <FR>French Trans of #2 v1</FR>
    <!-- More locales here -->
</Para>
<Para>
    <EN>English Content #2</EN>
    <DE>German Trans of #2 v3</DE>
    <FR>French Trans of #2 v1</FR>
    <!-- More locales here -->
</Para>
<Para>
    <EN>English Content #2</EN>
    <DE>German Trans of #2 v2</DE>
    <FR>French Trans of #2 v1</FR>
    <!-- More locales here -->
</Para>
</Books>

I get the result

<Books>
   <Para>
      <EN>English Content #1</EN>
      <DE>German Trans of #1 v1</DE>
      <DE>German Trans of #1 v2</DE>
      <FR>French Trans of #1 v1</FR>
      <FR>French Trans of #1 v2</FR>
   </Para>
   <Para>
      <EN>English Content #2</EN>
      <DE>German Trans of #2 v1</DE>
      <DE>German Trans of #2 v3</DE>
      <DE>German Trans of #2 v2</DE>
      <FR>French Trans of #2 v1</FR>
   </Para>
</Books>