0
votes

I have one really large input file which is an XML data.

So now when I put that in HDFS, logically the HDFS blocks will be created and the XML records will also be divided amongst blocks. Now the typical TextInputFormat handles the scenario by skipping the first line if it is not start of line and logically the previous mapper reads (over RPC) from this block till end of record.

In XML case how we can handle the scenario? I don't want to use the WholeFileInputFormat as that will not help me using the parallelism.

<books>
<book>
<author>Test</author>
<title>Hadoop Recipes</title>
<ISBN>04567GHFR</ISBN>
</book>
<book>
<author>Test</author>
<title>Hadoop Data</title>
<ISBN>04567ABCD</ISBN>
</book>
<book>
<author>Test1</author>
<title>C++</title>
<ISBN>FTYU9876</ISBN>
</book>
<book>
<author>Test1</author>
<title>Baby Tips</title>
<ISBN>ANBMKO09</ISBN>
</book>
</books>

The initialize function of the XMLRecordReader looks like -

public void initialize(InputSplit arg0, TaskAttemptContext arg1)
            throws IOException, InterruptedException {

        Configuration conf = arg1.getConfiguration();

        FileSplit split = (FileSplit) arg0;
        start = split.getStart();

        end = start + split.getLength();
        final Path file = split.getPath();
        FileSystem fs = file.getFileSystem(conf);
        fsin = fs.open(file);
        fsin.seek(start);

        DocumentBuilder db = null;
        try {
            db = DocumentBuilderFactory.newInstance()
                    .newDocumentBuilder();
        } catch (ParserConfigurationException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }

        Document doc = null;
        try {
            doc = db.parse(fsin);
        } catch (SAXException e) {
            e.printStackTrace();
        }
        NodeList nodes = doc.getElementsByTagName("book");

        for (int i = 0; i < nodes.getLength(); i++) {
            Element element = (Element) nodes.item(i);
            BookWritable book = new BookWritable();
            NodeList author = element.getElementsByTagName("author");
            Element line = (Element) author.item(0);
            book.setBookAuthor(new Text(getCharacterDataFromElement(line)));

            NodeList title = element.getElementsByTagName("title");
            line = (Element) title.item(0);
            book.setBookTitle(new Text(getCharacterDataFromElement(line)));

            NodeList isbn = element.getElementsByTagName("ISBN");
            line = (Element) isbn.item(0);
            book.setBookISBN(new Text(getCharacterDataFromElement(line)));

            mapBooks.put(Long.valueOf(i), book);
        }
        this.startPos = 0;
        endPos = mapBooks.size();
    }

Using the DOM parser for handling the XML parsing part, not sure but may be if I do a pattern match then the DOM parser parsing issue will be resolved (in case of broken XML in one of the splits) but will that also solve the last mapper completing the record from next input split?

Please correct me in case there is some fundamental issue and if any solution is there it will be a great help.

Thanks, AJ

3
XML in HDFS is not that efficient, it proved to be more performant to convert XML to cvs or AVRO using a script or stand alone program and than upload it into HDFS. To convert about 25GB of XMLs might take a couple of minutes.alexeipab

3 Answers

0
votes

You could very well try out mahout's XMLinputFormat class. More explanation in the book 'Hadoop in action'

0
votes

I don't think an XML file could be splittable by itself. THen I don't think there is a generic public solution for you. The problem is there is not way to understand the tag hierarchy starting in the middle of the XML unless you know the structure of the XML a priori.

But your XML is very simple and you can create an Ad-Hoc splitter. As you have explained, the TextInputFormat skip the first characters until it reach the beginning of a new text line. Well, you can do the same thing looking for book tag instead a new line. Copy the code but instead to look for "\n" character look for the open tag for your items.

Be sure to use a SAX parser in your development, use DOM is not a good option to deal with big XML's. In a SAX parser you read one by one each tag and take an action in each event instead to load all the file in memory as in the case of DOM Tree generation.

0
votes

Maybe split the XML file first. There are Open Source XML splitters. Also at least two commercial split tools that claim to handle the XML structure automatically to ensure each split file is well-formed XML. Google "xml split tool" or "xml splitter"