XML processing using Apache Flink

Question

I am a newbie to Apache Flink and distributed processing as well. I have already went through Flink quick setup guide and understand the basics of MapFunctions. But I couldnt find a concrete example for XML processing. I have read about Hadoops XmlInputFormat, but unable to understand how to use it.

My need is, I have huge(100MB) xml file of format as below,

<Class>
    <student>.....</student>
    <student>.....</student>
    .
    .
    .
    <student>.....</student>
</Class>

The flink processor would read the file from HDFS and start processing it(basically iterate through all the student element)

I want to know(in layman's terms), how can I process the xml and creata list of student object.

A simpler layman's explanation would be much appreciated

Fabian Hueske Fabian Hueske · Accepted Answer · 2016-10-24T22:26:51

Apache Mahout's XmlInputFormat for Apache Hadoop extracts the text between two tags (in your case probably <student> and </student>). Flink provides wrappers to use Hadoop InputFormats, e.g., via the readHadoopFile() method of ExecutionEnvironment.

If you do not want to use the XmlInputFormat and if your XML file is nicely formatted, i.e., each student record is in a single line, you can use Flink's regular TextInputFormat which reads the file line by line. A subsequent FlatMap function can parse all student lines and filter out all others.

XML processing using Apache Flink

1 Answers