How to handle Hadoop splits in case of large XML input file

Question

I have one really large input file which is an XML data.

So now when I put that in HDFS, logically the HDFS blocks will be created and the XML records will also be divided amongst blocks. Now the typical TextInputFormat handles the scenario by skipping the first line if it is not start of line and logically the previous mapper reads (over RPC) from this block till end of record.

In XML case how we can handle the scenario? I don't want to use the WholeFileInputFormat as that will not help me using the parallelism.

<books>
<book>
<author>Test</author>
<title>Hadoop Recipes</title>
<ISBN>04567GHFR</ISBN>
</book>
<book>
<author>Test</author>
<title>Hadoop Data</title>
<ISBN>04567ABCD</ISBN>
</book>
<book>
<author>Test1</author>
<title>C++</title>
<ISBN>FTYU9876</ISBN>
</book>
<book>
<author>Test1</author>
<title>Baby Tips</title>
<ISBN>ANBMKO09</ISBN>
</book>
</books>

The initialize function of the XMLRecordReader looks like -

public void initialize(InputSplit arg0, TaskAttemptContext arg1)
            throws IOException, InterruptedException {

        Configuration conf = arg1.getConfiguration();

        FileSplit split = (FileSplit) arg0;
        start = split.getStart();

        end = start + split.getLength();
        final Path file = split.getPath();
        FileSystem fs = file.getFileSystem(conf);
        fsin = fs.open(file);
        fsin.seek(start);

        DocumentBuilder db = null;
        try {
            db = DocumentBuilderFactory.newInstance()
                    .newDocumentBuilder();
        } catch (ParserConfigurationException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        Document doc = null;
        try {
            doc = db.parse(fsin);
        } catch (SAXException e) {
            e.printStackTrace();
        }
        NodeList nodes = doc.getElementsByTagName("book");

        for (int i = 0; i < nodes.getLength(); i++) {
            Element element = (Element) nodes.item(i);
            BookWritable book = new BookWritable();
            NodeList author = element.getElementsByTagName("author");
            Element line = (Element) author.item(0);
            book.setBookAuthor(new Text(getCharacterDataFromElement(line)));

            NodeList title = element.getElementsByTagName("title");
            line = (Element) title.item(0);
            book.setBookTitle(new Text(getCharacterDataFromElement(line)));

            NodeList isbn = element.getElementsByTagName("ISBN");
            line = (Element) isbn.item(0);
            book.setBookISBN(new Text(getCharacterDataFromElement(line)));

            mapBooks.put(Long.valueOf(i), book);
        }
        this.startPos = 0;
        endPos = mapBooks.size();
    }

Using the DOM parser for handling the XML parsing part, not sure but may be if I do a pattern match then the DOM parser parsing issue will be resolved (in case of broken XML in one of the splits) but will that also solve the last mapper completing the record from next input split?

Please correct me in case there is some fundamental issue and if any solution is there it will be a great help.

Thanks, AJ

XML in HDFS is not that efficient, it proved to be more performant to convert XML to cvs or AVRO using a script or stand alone program and than upload it into HDFS. To convert about 25GB of XMLs might take a couple of minutes. — alexeipab

Vignesh I Vignesh I · Accepted Answer · 2015-09-15T17:47:36

You could very well try out mahout's XMLinputFormat class. More explanation in the book 'Hadoop in action'

How to handle Hadoop splits in case of large XML input file

3 Answers