After a few hours research into XSLT, I am admitting defeat! I need to fix a large number of .xlf XLIFF translation files that have returned to us mangled from an unnamed translation tool. Ideally I would apply the XSL transform to them using a batch tool.
Below is a snippet of one of the XLIFF files:
<body>
<trans-unit id="1" phase-name="pretrans" restype="x-h3">
<source>Adding, Deleting or Modifying Notes in the Call Description</source>
<seg-source>Adding, Deleting or Modifying Notes in the Call Description</seg-source>
<target state="final">Добавление, удаление и изменение примечаний в описании звонка</target>
</trans-unit>
<trans-unit id="2" phase-name="pretrans" restype="x-p">
<source>Description of Fields on RHS</source>
<seg-source>Description of Fields on RHS</seg-source>
<target state="final">Поле описания в правой части</target>
</trans-unit>
<trans-unit id="3" phase-name="pretrans" restype="x-p">
<source>You can add descriptive text notes to a call recording, if you have the appropriate privileges to do so. These notes are visible to all users who have access to the call recording. It is recommended that each user add their initials to the notes to avoid potential confusion.</source>
<seg-source>
<mrk mtype="seg" mid="1">You can add descriptive text notes to a call recording, if you have the appropriate privileges to do so.</mrk>
<mrk mtype="seg" mid="2">These notes are visible to all users who have access to the call recording.</mrk>
<mrk mtype="seg" mid="3">It is recommended that each user add their initials to the notes to avoid potential confusion.</mrk>
</seg-source>
<target state="final">
<mrk mtype="seg" mid="1" /><ph ctype="" id="1"><MadCap:variable name="zoom_userdocs_variables.var_product_name" xmlns:MadCap="http://www.madcapsoftware.com/Schemas/MadCap.xsd" /></ph> позволяет находить телефонные взаимодействия, содержащие или не содержащие определенные фразы.
<mrk mtype="seg" mid="2" />Каждая речевая метка содержит одну или несколько таких фраз.
<mrk mtype="seg" mid="3" />Ядро <ph ctype="" id="3"><MadCap:variable name="zoom_userdocs_variables.var_product_name" xmlns:MadCap="http://www.madcapsoftware.com/Schemas/MadCap.xsd" /></ph> индексирует медиафайлы и помечает места вхождения фразы (добавляет к ним метки).
<mrk mtype="seg" mid="4" />Затем нужные медиафайлы можно искать по связанным с ними меткам.
</target>
</trans-unit>
<trans-unit id="4" phase-name="pretrans" restype="x-p">
<source>To add, delete, or modify text in the description field, click inside the description field.</source>
<seg-source>To add, delete, or modify text in the description field, click inside the description field.</seg-source>
<target state="final">Чтобы добавить, удалить или изменить текст в поле описания, щелкните это поле.</target>
</trans-unit>
</body>
Notice the target
tag in the third trans-unit
node. The mrk
tags should contain the text nodes that have now become siblings (compared to the earlier seg-source
tag, which is still correct), messing up the structure.
Therefore I am trying to identify any mrk
tags that do not contain text nodes, and move the following text node back into them.
Here is the desired result:
<body>
<trans-unit id="1" phase-name="pretrans" restype="x-h3">
<source>Adding, Deleting or Modifying Notes in the Call Description</source>
<seg-source>Adding, Deleting or Modifying Notes in the Call Description</seg-source>
<target state="final">Добавление, удаление и изменение примечаний в описании звонка</target>
</trans-unit>
<trans-unit id="2" phase-name="pretrans" restype="x-p">
<source>Description of Fields on RHS</source>
<seg-source>Description of Fields on RHS</seg-source>
<target state="final">Поле описания в правой части</target>
</trans-unit>
<trans-unit id="3" phase-name="pretrans" restype="x-p">
<source>You can add descriptive text notes to a call recording, if you have the appropriate privileges to do so. These notes are visible to all users who have access to the call recording. It is recommended that each user add their initials to the notes to avoid potential confusion.</source>
<seg-source>
<mrk mtype="seg" mid="1">You can add descriptive text notes to a call recording, if you have the appropriate privileges to do so.</mrk>
<mrk mtype="seg" mid="2">These notes are visible to all users who have access to the call recording.</mrk>
<mrk mtype="seg" mid="3">It is recommended that each user add their initials to the notes to avoid potential confusion.</mrk>
</seg-source>
<target state="final">
<mrk mtype="seg" mid="1"><ph ctype="" id="1"><MadCap:variable name="zoom_userdocs_variables.var_product_name" xmlns:MadCap="http://www.madcapsoftware.com/Schemas/MadCap.xsd" /></ph> позволяет находить телефонные взаимодействия, содержащие или не содержащие определенные фразы.</mrk>
<mrk mtype="seg" mid="2">Каждая речевая метка содержит одну или несколько таких фраз.</mrk>
<mrk mtype="seg" mid="3">Ядро <ph ctype="" id="3"><MadCap:variable name="zoom_userdocs_variables.var_product_name" xmlns:MadCap="http://www.madcapsoftware.com/Schemas/MadCap.xsd" /></ph> индексирует медиафайлы и помечает места вхождения фразы (добавляет к ним метки).</mrk>
<mrk mtype="seg" mid="4">Затем нужные медиафайлы можно искать по связанным с ними меткам.</mrk>
</target>
</trans-unit>
<trans-unit id="4" phase-name="pretrans" restype="x-p">
<source>To add, delete, or modify text in the description field, click inside the description field.</source>
<seg-source>To add, delete, or modify text in the description field, click inside the description field.</seg-source>
<target state="final">Чтобы добавить, удалить или изменить текст в поле описания, щелкните это поле.</target>
</trans-unit>
</body>
I would normally do this in Perl with LibXML or similar, but I'm sure that this is a simple task for XSLT. I've searched for a similar solution, but couldn't find anything that I could make work.
One other point to note - although 'pretty-printed' here, the final body
node definition is all on one line.
Thank you! I look forward to learning something new!
EDIT: Updated source above to show further child tags within <target>
elements, which must be retained.
EDIT 2: Added desired result.