7
votes

I want to write a method for a Java class. The method accepts as input a string of XML data as given below.

<?xml version="1.0" encoding="UTF-8"?>
<library>

    <book>
        <name> <> Programming in ANSI C <> </name>
        <author> <>  Balaguruswamy <> </author>
        <comment> <> This comment may contain xml entities such as &, < and >. <> </comment>
    </book>

    <book>
        <name> <> A Mathematical Theory of Communication <> </name>
        <author> <> Claude E. Shannon <> </author>
        <comment> <> This comment also may contain xml entities. <> </comment>
    </book>

    <!-- This library contains more than ten thousand books. -->
</library>

The XML string contains a lot of substring starting and ending with <>. The substring may contain XML entities such as >, <, &, ' and ". The method need to replace them with &gt;, &lt;, &amp;. &apos; and &quot; respectively.

Is there any regular-expression method in Java to accomplish this task?

2
Are you asking to escape all XML, or just the <> that happens in between tags?Justin Pihony
Who is generating the XML? It would seem that the correct way to fix the problem would be to output valid xml as opposed to tinkering with the contents.pimaster
The substring is taken from database. Since the XML string may contain more than thirty thousand substrings, it will be inefficient to escape all XML entities before adding to the XML string. That is why we just introduce the <> and the method is responsible for escaping the XML entities before using it.Mohammed H

2 Answers

3
votes

Is this data being passed to you, or can you control it? If so, then I would suggest using a CDATA block. If you are really unsure about the data being entered into the xml blocks, then just wrap everything in a CDATA before it is saved to the DB

If you do not have control over this, then as far as I know, this will take a fair amount of coding due to the number of edge cases you possibly will have to deal with. Not something that a simple regex will be able to deal with (if a valid block is starting, if one is ending, if one has already ended, etc)

Here is a very basic regex for the <> case, but the rest I really believe just get extremely complicated

\<\>* //For <> changes
2
votes

You can follow in an example

  1. Read a XML file by Dom or SAX
  2. Replace string by Regular expression
  3. Write a XML file by Dom or SAX