1
votes

My requirement is:

  • Receive XML document from client
  • Translate certain XML elements and attributes (according to predefined rules)
  • Write out translated XML document
  • Return XML document to client

The XML document MUST not be modified in any way other than the desired translations. This is a requirement of the client - when they make changes to their XML files it is done by a human and the human expects the XML formatting to look a certain way.

Is there an XML parser that will do this? Here is a simple example that uses the StAX parser but does not preserve some parts of the input xml:

XML Input:

<item>
  <!-- Comment for title -->
  <title>Title of Feed Item</title>
  <link>/mylink/article1</link>
  <description>
    <![CDATA[
      <p>Paragraph of text describing the article to be displayed</p>
    ]]>
  </description>
  <!-- Comment for nested item -->
  <parent>
    <child title="translatable attribute" foo='non translatable attr'>
      Translatable text
    </child>
  </parent>
</item>

StAX parser code:

@Test
public void testXmlParser() throws IOException, XMLStreamException {

    String xmlSource = IOUtils.toString(new FileInputStream("testsamples/example.xml"), "UTF-8");

    XMLInputFactory factory = XMLInputFactory.newInstance();
    XMLEventReader eventReader =
            factory.createXMLEventReader(new StringReader(xmlSource));

    Writer outputWriter = new StringWriter();
    XMLOutputFactory xmlOutputFactory = XMLOutputFactory.newInstance();
    XMLEventWriter xmlEventWriter = xmlOutputFactory
            .createXMLEventWriter(outputWriter);

    while(eventReader.hasNext()) {
        XMLEvent event = eventReader.nextEvent();
        xmlEventWriter.add(event);
    }

    //Assertion is false
    assertEquals(xmlSource, outputWriter.toString());
}

Output of StAX event writer:

<?xml version="1.0" ?><item>
  <!-- Comment for title -->
  <title>Title of Feed Item</title>
  <link>/mylink/article1</link>
  <description>

      &lt;p&gt;Paragraph of text describing the article to be displayed&lt;/p&gt;

  </description>
  <!-- Comment for nested item -->
  <parent>
    <child foo="non translatable attr" title="translatable attribute">
      Translatable text
    </child>
  </parent>
</item>

As you can see, the output includes an XML header which was not in the input, it has removed the CDATA section, it has reordered the attributes in the child element as well as replaced the single quotes with double quotes. Is there a Java library out there that will do what I want or should I write my own?

1
Pretty much write your own and remind the client that you wouldn't have billed all this additional and useless work if they had listened to the whole world telling them how to do XML or standardized formats in general. Now would have been a good time for them to get back to sanity, but instead they wanted to pay you to join the insanity. - kumesana
@Kumesana Yes I can see how you might think it's a stupid requirement. Here's another example: You want to write an XML text editor that does syntax highlighting. Obviously your editor should never make changes to the document that the user did not ask for. How do you parse the location of the elements, attributes and so on in order to highlight them with different colours? - Alex Spurling
Text editors are kinda supposed to work with themselves and maintain their own standards for tied-to-syntax highlighting. (Besides, in the real world, text editors suck and handle their syntax highlighting with extended regex rules, which doesn't cover all possibilities and you can always write a correct program they fail to highlight) - kumesana

1 Answers

0
votes

No, there is no such parser that I am aware of. There might be internal parsers embedded in XML editing tools, but I think they are too tightly coupled to be of general use.

You are not supposed to care whether attributes are delimited by single or double quotes, or whether there is whitespace around the "=" sign, or whether the 1-bits in the UTF-8 encoding are represented by a positive or negative voltage, so the parser doesn't tell you. If you do care, then you are probably doing things wrong: successful software engineering depends on understanding the layers of abstraction you are working with.

PS: managing clients who try to impose bad engineering on you is one of those important IT skills that never appears in a CV...