Logistic Regression PySpark MLlib issue with multiple labels

Question

I am trying to create a LogisticRegression model (LogisticRegressionWithSGD), but its getting an error of

org.apache.spark.SparkException: Input validation failed.

If I give it binary input (0,1 instead of 0,1,2) it does succeed.

example input:

parsed_data = [LabeledPoint(0.0, [4.6,3.6,1.0,0.2]),
LabeledPoint(0.0, [5.7,4.4,1.5,0.4]),
LabeledPoint(1.0, [6.7,3.1,4.4,1.4]),
LabeledPoint(0.0, [4.8,3.4,1.6,0.2]),
LabeledPoint(2.0, [4.4,3.2,1.3,0.2])]

Code: model = LogisticRegressionWithSGD.train(parsed_data)

Is the Logistic Regression model in spark supposed to be for binary classification only?

desertnaut desertnaut · Accepted Answer · 2015-11-06T08:13:18

Although not clear from the documentation (you have to dig in to the source code to realize it), LogisticRegressionWithSGD works only with binary data; for multinomial regression, you should use LogisticRegressionWithLBFGS:

 from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel, LogisticRegressionWithSGD
 from pyspark.mllib.regression import LabeledPoint
 parsed_data = [LabeledPoint(0.0, [4.6,3.6,1.0,0.2]),
                LabeledPoint(0.0, [5.7,4.4,1.5,0.4]),
                LabeledPoint(1.0, [6.7,3.1,4.4,1.4]),
                LabeledPoint(0.0, [4.8,3.4,1.6,0.2]),
                LabeledPoint(2.0, [4.4,3.2,1.3,0.2])]     

 model = LogisticRegressionWithSGD.train(sc.parallelize(parsed_data)) # gives error:
 # org.apache.spark.SparkException: Input validation failed.

 model = LogisticRegressionWithLBFGS.train(sc.parallelize(parsed_data), numClasses=3)  # works OK

Logistic Regression PySpark MLlib issue with multiple labels

1 Answers