hadoop mapper reading multiple lines

Question

New to hadoop - I am trying to read in my HDFS file in chunks, for example - 100 lines at a time and then running regression with the data using apache OLSMultipleLinearRegression in the mapper. I am using this code shown here to read in multiple lines: http://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/

My mapper is defined as:

public void map(LongWritable key, Text value,Context context) throws java.io.IOException ,InterruptedException
{
    String lines = value.toString();
    String []lineArr = lines.split("\n");
    int lcount = lineArr.length;
    System.out.println(lcount); // prints out "1"
    context.write(new Text(new Integer(lcount).toString()),new IntWritable(1));
}

My question is: how come lcount==1 from system.out.println? My file is delimited by "\n" and I have set NLINESTOPROCESS = 3 in the record reader. My input file is formatted as :

y x1 x2 x3 x4 x5
y x1 x2 x3 x4 x5
y x1 x2 x3 x4 x5
...

I cannot perform my multiple regression if i am only reading 1 line at a time, as the regression API takes in multiple data points... thank you for any help

In hadoop the data coming to mapper is line by line if you use TextInputFormat as your Input class.If you need the whole data you should use WholeFileInputFormat. — USB

Brian Roach Brian Roach · Accepted Answer · 2013-02-03T04:15:53

String.split() takes a regular expression as an argument. You have to double escape.

String []lineArr = lines.split("\\n");

hadoop mapper reading multiple lines

1 Answers