I am trying to parse an XML file (it's an XLIFF translation file, to be more precise), and transform it into (slightly different) TMX format.
My source XLIFF file looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<xliff version="1.0">
<file origin="Some/Folder/proj/SomeFile.strings" source-language="en" target-language="hr" datatype="strings" product="Product BlahBlah" product-version="3.9.12" build-num="1" x-train="Blurt">
<header>
<count-group name="SomeFile.strings">
<count count-type="total" unit="word">2</count>
</count-group>
</header>
<body>
<trans-unit id="8.text" restype="string" resname=""><source>End</source><target match-quality="80" match-description="_predecessor(22) _path(0) _file(15) datatype(5) id(17) restype(6) resname(4) _reserved(11) _one-word-threshold(-25)" state="signed-off" x-match-attributes="preserved-stable" state-qualifier="exact-match" x-leverage-path="predecessor-ice">Kraj</target><note>This is a note</note></trans-unit>
</body>
</file>
<file origin="Some/Folder/proj/SomeOtherFile.strings" source-language="en" target-language="hr" datatype="strings" product="Product BlahBlah2" product-version="3.12.56" build-num="1" x-train="Blurt2">
<header>
<count-group name="SomeOtherFile.strings">
<count count-type="total" unit="word">4</count>
</count-group>
</header>
<body>
<trans-unit id="14.accessibilityLabel" restype="string" resname=""><source>return to project list</source><target match-quality="80" match-description="_predecessor(22) _path(0) _file(15) datatype(5) id(17) restype(6) resname(4) _reserved(11)" state="signed-off" x-match-attributes="preserved-stable" state-qualifier="exact-match" x-leverage-path="predecessor-ice">povratak na popis projekata</target><note>This is again a note</note></trans-unit>
</body>
</file>
(and more <file> elements continue... some with many more <trans-unit> </trans-unit> elements than these above)
</xliff>
What I am aiming to do is to rearrange and simplify these slightly, to get the above into following format:
<tu>
<prop type="FileSource">SomeFile.strings</prop>
<tuv xml:lang="en">
<seg>End</seg>
</tuv>
<tuv xml:lang="hr">
<prop type="Note">This is a note</prop>
<seg>Kraj</seg>
</tuv>
</tu>
<tu>
<prop type="FileSource">SomeOtherFile.strings</prop>
<tuv xml:lang="en">
<seg>return to project list</seg>
</tuv>
<tuv xml:lang="hr">
<prop type="Note">This is again a note</prop></prop>
<seg>povratak na popis projekata</seg>
</tuv>
</tu>
Note that the original XLIFF file may have several <file origin ...> parts, each with many <trans-unit ...> elements (which are actual strings from that file...)
I have managed to code a part that gives me the "Source" and "Target" parts OK, but what I still need are the parts from the "file origin" elements... where the languages are defined (i.e. "source-language" and "target-language", which I will then write out as <tuv xml:lang="en"> and <tuv xml:lang="hr"> for each string), and where I can find the relevant reference to strings file (i.e. "SomeFile.strings" and "SomeOtherFile.strings", to be used as <prop type="FileSource">SomeFile.strings</prop>).
Currently I have the following Python code, which nicely extracts the required "source" and "target" elements:
#!/usr/bin/env python3
#
import sys
from lxml import etree
if len(sys.argv) < 2:
print('Wrong number of arguments:\n => You need to provide a filename for processing!')
exit()
file = sys.argv[1]
tree = etree.iterparse(file)
for action, elem in tree:
if elem.tag == "source":
print("<TransUnit>")
print("\t<Source>" + elem.text + "</Source>")
elif elem.tag == "target":
print("\t<Target>" + elem.text + "</Target>")
elif elem.tag == "note":
if elem.text is not None:
print("\t<Note>" + elem.text + "</Note>")
print("</TransUnit>")
else:
print("</TransUnit>")
else:
next
Now, how could I also extract "source-language" (i.e. the value "en"), "target-language" (i.e. value "hr") and the file reference (i.e. "SomeFile.strings") from the "file origin ...." elements in the original XLIFF file?
Also, I need to keep (remember) that file reference, i.e.:
<prop type="FileSource">SomeOtherFile.strings</prop>
- for all translation (
<tu>) units that belong to that file (there can be many, unlike in the above sample, where each "file" has only one
So, for example, I would have:
<tu>
<prop type="FileSource">SomeFile.strings</prop>
<tuv xml:lang="en">
<seg>End</seg>
</tuv>
<tuv xml:lang="hr">
<prop type="Note">This is a note</prop>
<seg>Kraj</seg>
</tuv>
</tu>
<tu>
<prop type="FileSource">SomeFile.strings</prop>
<tuv xml:lang="en">
<seg>Start</seg>
</tuv>
<tuv xml:lang="hr">
<prop type="Note">This is a note</prop>
<seg>Početak</seg>
</tuv>
</tu>
- where each
<tu>element has a<prop type="FileSource">element showing from which file it comes...
I would more than appreciate any help in this regard...