Python: Parsing XML (xliff) file including headers

Question

I am trying to parse an XML file (it's an XLIFF translation file, to be more precise), and transform it into (slightly different) TMX format.

My source XLIFF file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<xliff version="1.0">
  <file origin="Some/Folder/proj/SomeFile.strings" source-language="en" target-language="hr" datatype="strings" product="Product BlahBlah" product-version="3.9.12" build-num="1" x-train="Blurt">
    <header>
      <count-group name="SomeFile.strings">
        <count count-type="total" unit="word">2</count>
      </count-group>
    </header>
    <body>
      <trans-unit id="8.text" restype="string" resname=""><source>End</source><target match-quality="80" match-description="_predecessor(22) _path(0) _file(15) datatype(5) id(17) restype(6) resname(4) _reserved(11) _one-word-threshold(-25)" state="signed-off" x-match-attributes="preserved-stable" state-qualifier="exact-match" x-leverage-path="predecessor-ice">Kraj</target><note>This is a note</note></trans-unit>
    </body>
  </file>
  <file origin="Some/Folder/proj/SomeOtherFile.strings" source-language="en" target-language="hr" datatype="strings" product="Product BlahBlah2" product-version="3.12.56" build-num="1" x-train="Blurt2">
    <header>
      <count-group name="SomeOtherFile.strings">
        <count count-type="total" unit="word">4</count>
      </count-group>
    </header>
    <body>
      <trans-unit id="14.accessibilityLabel" restype="string" resname=""><source>return to project list</source><target match-quality="80" match-description="_predecessor(22) _path(0) _file(15) datatype(5) id(17) restype(6) resname(4) _reserved(11)" state="signed-off" x-match-attributes="preserved-stable" state-qualifier="exact-match" x-leverage-path="predecessor-ice">povratak na popis projekata</target><note>This is again a note</note></trans-unit>
    </body>
  </file>

  (and more <file> elements continue... some with many more <trans-unit> </trans-unit> elements than these above)

  </xliff>

What I am aiming to do is to rearrange and simplify these slightly, to get the above into following format:

<tu>
    <prop type="FileSource">SomeFile.strings</prop>
    <tuv xml:lang="en">
        <seg>End</seg>
    </tuv>
    <tuv xml:lang="hr">
        <prop type="Note">This is a note</prop>
        <seg>Kraj</seg>
    </tuv>
</tu>
<tu>
    <prop type="FileSource">SomeOtherFile.strings</prop>
    <tuv xml:lang="en">
        <seg>return to project list</seg>
    </tuv>
    <tuv xml:lang="hr">
        <prop type="Note">This is again a note</prop></prop>
        <seg>povratak na popis projekata</seg>
    </tuv>
</tu>

Note that the original XLIFF file may have several <file origin ...> parts, each with many <trans-unit ...> elements (which are actual strings from that file...)

I have managed to code a part that gives me the "Source" and "Target" parts OK, but what I still need are the parts from the "file origin" elements... where the languages are defined (i.e. "source-language" and "target-language", which I will then write out as <tuv xml:lang="en"> and <tuv xml:lang="hr"> for each string), and where I can find the relevant reference to strings file (i.e. "SomeFile.strings" and "SomeOtherFile.strings", to be used as <prop type="FileSource">SomeFile.strings</prop>).

Currently I have the following Python code, which nicely extracts the required "source" and "target" elements:

#!/usr/bin/env python3
#

import sys

from lxml import etree

if len(sys.argv) < 2:
    print('Wrong number of arguments:\n => You need to provide a filename for processing!')
    exit()

file = sys.argv[1]

tree = etree.iterparse(file)
for action, elem in tree:
    if elem.tag == "source":
        print("<TransUnit>")
        print("\t<Source>" + elem.text  + "</Source>")
    elif elem.tag == "target":
        print("\t<Target>" + elem.text + "</Target>")
    elif elem.tag == "note":
        if elem.text is not None:
            print("\t<Note>" + elem.text + "</Note>")
            print("</TransUnit>")
        else: 
            print("</TransUnit>")
    else:
        next

Now, how could I also extract "source-language" (i.e. the value "en"), "target-language" (i.e. value "hr") and the file reference (i.e. "SomeFile.strings") from the "file origin ...." elements in the original XLIFF file?

Also, I need to keep (remember) that file reference, i.e.:

<prop type="FileSource">SomeOtherFile.strings</prop>

for all translation (<tu>) units that belong to that file (there can be many, unlike in the above sample, where each "file" has only one

So, for example, I would have:

<tu>
    <prop type="FileSource">SomeFile.strings</prop>
    <tuv xml:lang="en">
        <seg>End</seg>
    </tuv>
    <tuv xml:lang="hr">
        <prop type="Note">This is a note</prop>
        <seg>Kraj</seg>
    </tuv>
</tu>
<tu>
    <prop type="FileSource">SomeFile.strings</prop>
    <tuv xml:lang="en">
        <seg>Start</seg>
    </tuv>
    <tuv xml:lang="hr">
        <prop type="Note">This is a note</prop>
        <seg>Početak</seg>
    </tuv>
</tu>

where each <tu> element has a <prop type="FileSource"> element showing from which file it comes...

I would more than appreciate any help in this regard...

Denis_HR Denis_HR · Accepted Answer · 2019-05-10T22:33:12

Heh, as it often happens, I arrived at the usable solution after some more digging... Perhaps my question was unnecessarily complex, while the problem was actually identifying the proper root element(s), and proper addressing (and targeting) of children and grandchildren.

Anyway, another stackoverflow thread got me on the right path, so the solution which suits me now looks like this:

#!/usr/bin/env python3
#

import sys
import os

from lxml import etree

if len(sys.argv) < 2:
    print('Wrong number of arguments:\n => You need to provide a filename for processing!')
    exit()

file = sys.argv[1]

tree = etree.parse(file)
root = tree.getroot()

print("<?xml version=\"1.0\" encoding=\"utf-8\"?>\n<!DOCTYPE tmx SYSTEM \"tmx14.dtd\">\n<tmx version=\"1.4\">")
print("\n<header srclang=\"en\" creationtool=\"XLIFF to TMX\" datatype=\"unknown\" adminlang=\"en\" segtype=\"sentence\" creationtoolversion=\"1.0\">")
print("</header>\n<body>")

for element in root:
    FileOrigin = (os.path.basename(element.attrib['origin']))
    Product = element.attrib['product']
    Source = element.attrib['source-language']
    Target =  element.attrib['target-language']
    # now the children
    for all_tags in element.findall('.//'):
        if all_tags.tag == "source":
            # replacing some troublesome and unnecessary codes
            srctxt = all_tags.text
            srctxt = srctxt.replace('^n', ' ')
            srctxt = srctxt.replace('^b', ' ')
            print("<tu>")
            print("\t<prop type=\"FileSource\">" + FileOrigin + "</prop>")
            print("\t<tuv xml:lang=\"" + Source + "\">")
            print("\t\t<seg>" + srctxt + "</seg>")
        elif all_tags.tag == "target":
            # replacing the same troublesome and unnecessary codes
            targtxt = all_tags.text
            targtxt = targtxt.replace('^n', ' ')
            targtxt = targtxt.replace('^b', ' ')
            print("\t<tuv xml:lang=\"" + Target + "\">")
            print("\t\t<seg>" + targtxt + "</seg>")
        elif all_tags.tag == "note":
            if all_tags.text is not None:
                print("\t\t<prop type=\"Note\">" + all_tags.text.replace('^n', ' ') + "</prop>")
                print("</tu>")
            else: 
                print("</tu>")
        else:
            next
print("</body>\n</tmx>")

Will probably tidy up a bit and add some more bells and whistles, but in general, this solves my original problem. Perhaps it might help others trying to do some xliff parsing...

Python: Parsing XML (xliff) file including headers

2 Answers