32
votes

I have an XML file stored as a DOM Document and I would like to pretty print it to the console, preferably without using an external library. I am aware that this question has been asked multiple times on this site, however none of the previous answers have worked for me. I am using java 8, so perhaps this is where my code differs from previous questions? I have also tried to set the transformer manually using code found from the web, however this just caused a not found error.

Here is my code which currently just outputs each xml element on a new line to the left of the console.

import java.io.*;
import javax.xml.parsers.*;
import javax.xml.transform.*;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import org.w3c.dom.Document;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;


public class Test {
    public Test(){
        try {
            //java.lang.System.setProperty("javax.xml.transform.TransformerFactory", "org.apache.xalan.xsltc.trax.TransformerFactoryImpl");

            DocumentBuilderFactory dbFactory;
            DocumentBuilder dBuilder;
            Document original = null;
            try {
                dbFactory = DocumentBuilderFactory.newInstance();
                dBuilder = dbFactory.newDocumentBuilder();
                original = dBuilder.parse(new InputSource(new InputStreamReader(new FileInputStream("xml Store - Copy.xml"))));
            } catch (SAXException | IOException | ParserConfigurationException e) {
                e.printStackTrace();
            }
            StringWriter stringWriter = new StringWriter();
            StreamResult xmlOutput = new StreamResult(stringWriter);
            TransformerFactory tf = TransformerFactory.newInstance();
            //tf.setAttribute("indent-number", 2);
            Transformer transformer = tf.newTransformer();
            transformer.setOutputProperty(OutputKeys.METHOD, "xml");
            transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", "4");
            transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
            transformer.setOutputProperty(OutputKeys.INDENT, "yes");
            transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
            transformer.transform(new DOMSource(original), xmlOutput);
            java.lang.System.out.println(xmlOutput.getWriter().toString());
        } catch (Exception ex) {
            throw new RuntimeException("Error converting to String", ex);
        }
    }

    public static void main(String[] args){
        new Test();
    }

}
7

7 Answers

55
votes

In reply to Espinosa's comment, here is a solution when "the original xml is not already (partially) indented or contain new lines".

Background

Excerpt from the article (see References below) inspiring this solution:

Based on the DOM specification, whitespaces outside the tags are perfectly valid and they are properly preserved. To remove them, we can use XPath’s normalize-space to locate all the whitespace nodes and remove them first.

Java Code

public static String toPrettyString(String xml, int indent) {
    try {
        // Turn xml string into a document
        Document document = DocumentBuilderFactory.newInstance()
                .newDocumentBuilder()
                .parse(new InputSource(new ByteArrayInputStream(xml.getBytes("utf-8"))));

        // Remove whitespaces outside tags
        document.normalize();
        XPath xPath = XPathFactory.newInstance().newXPath();
        NodeList nodeList = (NodeList) xPath.evaluate("//text()[normalize-space()='']",
                                                      document,
                                                      XPathConstants.NODESET);

        for (int i = 0; i < nodeList.getLength(); ++i) {
            Node node = nodeList.item(i);
            node.getParentNode().removeChild(node);
        }

        // Setup pretty print options
        TransformerFactory transformerFactory = TransformerFactory.newInstance();
        transformerFactory.setAttribute("indent-number", indent);
        Transformer transformer = transformerFactory.newTransformer();
        transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
        transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");

        // Return pretty print xml string
        StringWriter stringWriter = new StringWriter();
        transformer.transform(new DOMSource(document), new StreamResult(stringWriter));
        return stringWriter.toString();
    } catch (Exception e) {
        throw new RuntimeException(e);
    }
}

Sample usage

String xml = "<root>" + //
             "\n   "  + //
             "\n<name>Coco Puff</name>" + //
             "\n        <total>10</total>    </root>";

System.out.println(toPrettyString(xml, 4));

Output

<root>
    <name>Coco Puff</name>
    <total>10</total>
</root>

References

10
votes

I guess that the problem is related to blank text nodes (i.e. text nodes with only whitespaces) in the original file. You should try to programmatically remove them just after the parsing, using the following code. If you don't remove them, the Transformer is going to preserve them.

original.getDocumentElement().normalize();
XPathExpression xpath = XPathFactory.newInstance().newXPath().compile("//text()[normalize-space(.) = '']");
NodeList blankTextNodes = (NodeList) xpath.evaluate(original, XPathConstants.NODESET);

for (int i = 0; i < blankTextNodes.getLength(); i++) {
     blankTextNodes.item(i).getParentNode().removeChild(blankTextNodes.item(i));
}
5
votes

This works on Java 8:

public static void main (String[] args) throws Exception {
    String xmlString = "<hello><from>ME</from></hello>";
    DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
    Document document = documentBuilder.parse(new InputSource(new StringReader(xmlString)));
    pretty(document, System.out, 2);
}

private static void pretty(Document document, OutputStream outputStream, int indent) throws Exception {
    TransformerFactory transformerFactory = TransformerFactory.newInstance();
    Transformer transformer = transformerFactory.newTransformer();
    transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    if (indent > 0) {
        transformer.setOutputProperty(OutputKeys.INDENT, "yes");
        transformer.setOutputProperty("{http://xml.apache.org/xslt}indent-amount", Integer.toString(indent));
    }
    Result result = new StreamResult(outputStream);
    Source source = new DOMSource(document);
    transformer.transform(source, result);
}
2
votes

I've written a simple class for for removing whitespace in documents - supports command-line and does not use DOM / XPath.

Edit: Come to think of it, the project also contains a pretty-printer which handles existing whitespace:

PrettyPrinter prettyPrinter = PrettyPrinterBuilder.newPrettyPrinter().ignoreWhitespace().build();
1
votes

Underscore-java has static method U.formatXml(string). I am the maintainer of the project. Live example

import com.github.underscore.lodash.U;

public class MyClass {
    public static void main(String args[]) {
        String xml = "<root>" + //
             "\n   "  + //
             "\n<name>Coco Puff</name>" + //
             "\n        <total>10</total>    </root>";

        System.out.println(U.formatXml(xml));
    }
}

Output:

<root>
   <name>Coco Puff</name>
   <total>10</total>
</root>
0
votes

I didn't like any of the common XML formatting solutions because they all remove more than 1 consecutive new line character (for some reason, removing spaces/tabs and removing new line characters are inseparable...). Here's my solution, which was actually made for XHTML but should do the job with XML as well:

public String GenerateTabs(int tabLevel) {
  char[] tabs = new char[tabLevel * 2];
  Arrays.fill(tabs, ' ');

  //Or:
  //char[] tabs = new char[tabLevel];
  //Arrays.fill(tabs, '\t');

  return new String(tabs);
}

public String FormatXHTMLCode(String code) {
  // Split on new lines.
  String[] splitLines = code.split("\\n", 0);

  int tabLevel = 0;

  // Go through each line.
  for (int lineNum = 0; lineNum < splitLines.length; ++lineNum) {
    String currentLine = splitLines[lineNum];

    if (currentLine.trim().isEmpty()) {
      splitLines[lineNum] = "";
    } else if (currentLine.matches(".*<[^/!][^<>]+?(?<!/)>?")) {
      splitLines[lineNum] = GenerateTabs(tabLevel) + splitLines[lineNum];

      ++tabLevel;
    } else if (currentLine.matches(".*</[^<>]+?>")) {
      --tabLevel;

      if (tabLevel < 0) {
        tabLevel = 0;
      }

      splitLines[lineNum] = GenerateTabs(tabLevel) + splitLines[lineNum];
    } else if (currentLine.matches("[^<>]*?/>")) {
      splitLines[lineNum] = GenerateTabs(tabLevel) + splitLines[lineNum];

      --tabLevel;

      if (tabLevel < 0) {
        tabLevel = 0;
      }
    } else {
      splitLines[lineNum] = GenerateTabs(tabLevel) + splitLines[lineNum];
    }
  }

  return String.join("\n", splitLines);
}

It makes one assumption: that there are no <> characters except for those that comprise the XML/XHTML tags.

-3
votes

Create xml file :

new FileInputStream("xml Store - Copy.xml") ;// result xml file format incorrect ! 

so that, when parse the content of the given input source as an XML document and return a new DOM object.

Document original = null;
...
original.parse("data.xml");//input source as an XML document