Spark Logistic Regression Error Dimension Mismatch

Question

I just started using spark and am trying to run a logistic regression. I keep getting this error:

Caused by: java.lang.IllegalArgumentException: requirement failed: 
Dimensions mismatch when adding new sample. Expecting 21 but got 17.

The number of features that I have is 21 , but I'm not sure what the 17 means here. Not sure what to do? My code is here:

from pyspark.mllib.regression import LabeledPoint
from numpy import array

def isfloat(string):
   try:
    float(string)
        return True
    except ValueError:
        return False

def parse_interaction(line):
    line_split = line.split(",")
    # leave_out = [1,2,3]
    clean_line_split = line_split[3:24]
    retention = 1.0
    if line_split[0] == '0.0':
       retention = 0.0
    return LabeledPoint(retention, array([map(float,i) for i in clean_line_split if isfloat(i)]))

training_data = raw_data.map(parse_interaction)

from pyspark.mllib.classification import LogisticRegressionWithLBFGS
from time import time

t0 = time()
logit_model = LogisticRegressionWithLBFGS.train(training_data)
tt = time() - t0

print "Classifier trained in {} seconds".format(round(tt,3))

Since you filter out values when creating array its length can be anywhere between 0 and expected size. It would make more sense to drop malformed entries whatsoever. — zero323

Hokam Hokam · Accepted Answer · 2016-09-20T03:09:53

Looks like some problem with the raw data. I guess some of the values are not passing through the isFloat validation. Can you just try printing the values on console, It will help you in identifying the error lines.

Spark Logistic Regression Error Dimension Mismatch

2 Answers