1
votes

I just started using spark and am trying to run a logistic regression. I keep getting this error:

Caused by: java.lang.IllegalArgumentException: requirement failed: 
Dimensions mismatch when adding new sample. Expecting 21 but got 17.

The number of features that I have is 21 , but I'm not sure what the 17 means here. Not sure what to do? My code is here:

from pyspark.mllib.regression import LabeledPoint
from numpy import array

def isfloat(string):
   try:
    float(string)
        return True
    except ValueError:
        return False

def parse_interaction(line):
    line_split = line.split(",")
    # leave_out = [1,2,3]
    clean_line_split = line_split[3:24]
    retention = 1.0
    if line_split[0] == '0.0':
       retention = 0.0
    return LabeledPoint(retention, array([map(float,i) for i in clean_line_split if isfloat(i)]))

training_data = raw_data.map(parse_interaction)

from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from time import time

t0 = time()
logit_model = LogisticRegressionWithLBFGS.train(training_data)
tt = time() - t0

print "Classifier trained in {} seconds".format(round(tt,3))
2
Since you filter out values when creating array its length can be anywhere between 0 and expected size. It would make more sense to drop malformed entries whatsoever. - zero323

2 Answers

0
votes

Looks like some problem with the raw data. I guess some of the values are not passing through the isFloat validation. Can you just try printing the values on console, It will help you in identifying the error lines.

0
votes

The error comes from the matrix multiplication where dimensions are not matching. array is not getting all 21 values. I suggest you to set variables to 0 in case they are not float, as that (seemingly) you want