Scala Multiclass classification with labeled point

Question

I have a multiclass classification problem I'm looking to sort with logistic regression. I know this can also be tackled by decision trees and random forest, but wish to stick specifically with "LogisticRegressionWithLBFGS". I have all the data tidying done. I have my data nice and tidy in a dataframe with a: label field (String), a feature vector (vector of features/ numbers) and a third column "LabelIndex" (numbers representing the class).

When I do a train test split on the data frame and try to fit it to: LogisticRegressionWithLBFGS

val model = new LogisticRegressionWithLBFGS().setNumClasses(10).setIntercept(true).setValidateData(true).run("trainingData")

It doesn't like the "run" part.

The example I am working off, loads a data file in via:

val data = MLUtils.loadLibSVMFile(Spark.sparkContext, "data/mnist.bz2")

(i'm trying to copy the example, and slot in my own data. But its in a different format, looks different etc) I was doing a bit of reading, and I'd come across, I need to convert my dataframe to a RDD[LabeledPoint]. I need to map it.

I'm having problems finding good info on how to do this.

How do I simply convert a Dataframe with 3 fields as described above, "Label" (String), "Features" (feature vector), "IndexedLabel" (Double) into a RDD[LabeledPoint]?

^^ please update the question with the appropriate tag - apache-spark-ml or apache-spark-mllib. — desertnaut
I am using: import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS — JetS79
I found this: and got the dataframe into labeledpoint format/instance: stackoverflow.com/questions/45882444/… — JetS79

JetS79 JetS79 · Accepted Answer · 2021-03-08T11:48:53

Got it working:

Can't convert Dataframe to Labeled Point

This link showed me how to make the conversion successfully.

Scala Multiclass classification with labeled point

1 Answers