0
votes

New to hadoop - I am trying to read in my HDFS file in chunks, for example - 100 lines at a time and then running regression with the data using apache OLSMultipleLinearRegression in the mapper. I am using this code shown here to read in multiple lines: http://bigdatacircus.com/2012/08/01/wordcount-with-custom-record-reader-of-textinputformat/

My mapper is defined as:

public void map(LongWritable key, Text value,Context context) throws java.io.IOException ,InterruptedException
{
    String lines = value.toString();
    String []lineArr = lines.split("\n");
    int lcount = lineArr.length;
    System.out.println(lcount); // prints out "1"
    context.write(new Text(new Integer(lcount).toString()),new IntWritable(1));
}

My question is: how come lcount==1 from system.out.println? My file is delimited by "\n" and I have set NLINESTOPROCESS = 3 in the record reader. My input file is formatted as :

y x1 x2 x3 x4 x5
y x1 x2 x3 x4 x5
y x1 x2 x3 x4 x5
...

I cannot perform my multiple regression if i am only reading 1 line at a time, as the regression API takes in multiple data points... thank you for any help

1
In hadoop the data coming to mapper is line by line if you use TextInputFormat as your Input class.If you need the whole data you should use WholeFileInputFormat. - USB

1 Answers

0
votes

String.split() takes a regular expression as an argument. You have to double escape.

String []lineArr = lines.split("\\n");