2
votes

So i have ten different files where each file looks like this.

<DocID1>    <RDF Document>
<DocID2>    <RDF Document>
.
.
.
.
<DocID50000>    <RDF Document>

There are actually ~56,000 lines per file. There's a document ID in each line and a RDF document.

My objective is to pass into each mapper as the input key value pair and emit multiple for the output key value pairs. In the reduce step, I will store these into a Hive table.

I have a couple of questions getting started and I am completely new to RDF/XML files.

  1. How am I supposed to parse each line of the document to get separately to pass to each mapper?

  2. Is there an efficient way of controlling the size of the input for the mapper?

1

1 Answers

1
votes

1- If you are using TextInputFormat you are automatically getting 1 line(1 split) in each mapper as the value. Convert this line into String and do the desired processing. Alternatively you could make use of Hadoop Streaming API by using StreamXmlRecordReader. You have to provide the start and end tag and all the information sandwiched between start and tag will be fed to the mapper(In your case <DocID1> and <RDF Document>).

Usage :

hadoop jar hadoop-streaming.jar -inputreader "StreamXmlRecord,begin=DocID,end=RDF Document" ..... (rest of the command)

2- Why do you need that? Your goal is to feed one complete line to a mapper. It's something which is the the job of InputFormat you are using. If you still need it, you have to write custom code for this and for this particular case it's going to be a bit tricky.