My requirement is:
- Receive XML document from client
- Translate certain XML elements and attributes (according to predefined rules)
- Write out translated XML document
- Return XML document to client
The XML document MUST not be modified in any way other than the desired translations. This is a requirement of the client - when they make changes to their XML files it is done by a human and the human expects the XML formatting to look a certain way.
Is there an XML parser that will do this? Here is a simple example that uses the StAX parser but does not preserve some parts of the input xml:
XML Input:
<item>
<!-- Comment for title -->
<title>Title of Feed Item</title>
<link>/mylink/article1</link>
<description>
<![CDATA[
<p>Paragraph of text describing the article to be displayed</p>
]]>
</description>
<!-- Comment for nested item -->
<parent>
<child title="translatable attribute" foo='non translatable attr'>
Translatable text
</child>
</parent>
</item>
StAX parser code:
@Test
public void testXmlParser() throws IOException, XMLStreamException {
String xmlSource = IOUtils.toString(new FileInputStream("testsamples/example.xml"), "UTF-8");
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLEventReader eventReader =
factory.createXMLEventReader(new StringReader(xmlSource));
Writer outputWriter = new StringWriter();
XMLOutputFactory xmlOutputFactory = XMLOutputFactory.newInstance();
XMLEventWriter xmlEventWriter = xmlOutputFactory
.createXMLEventWriter(outputWriter);
while(eventReader.hasNext()) {
XMLEvent event = eventReader.nextEvent();
xmlEventWriter.add(event);
}
//Assertion is false
assertEquals(xmlSource, outputWriter.toString());
}
Output of StAX event writer:
<?xml version="1.0" ?><item>
<!-- Comment for title -->
<title>Title of Feed Item</title>
<link>/mylink/article1</link>
<description>
<p>Paragraph of text describing the article to be displayed</p>
</description>
<!-- Comment for nested item -->
<parent>
<child foo="non translatable attr" title="translatable attribute">
Translatable text
</child>
</parent>
</item>
As you can see, the output includes an XML header which was not in the input, it has removed the CDATA section, it has reordered the attributes in the child element as well as replaced the single quotes with double quotes. Is there a Java library out there that will do what I want or should I write my own?