So i have ten different files where each file looks like this.
<DocID1> <RDF Document>
<DocID2> <RDF Document>
.
.
.
.
<DocID50000> <RDF Document>
There are actually ~56,000 lines per file. There's a document ID in each line and a RDF document.
My objective is to pass into each mapper as the input key value pair and emit multiple for the output key value pairs. In the reduce step, I will store these into a Hive table.
I have a couple of questions getting started and I am completely new to RDF/XML files.
How am I supposed to parse each line of the document to get separately to pass to each mapper?
Is there an efficient way of controlling the size of the input for the mapper?