2
votes

I am using NaiveBayes multinomial classifier in Apache Spark ML (version 2.1.0) to predict some text categories.

Problem is how do I convert the prediction label(0.0, 1.0, 2.0) to string without trained DataFrame.

I know IndexToString can be used but its only helpful if training and prediction both are at the same time. But, In my case its independent job.

code looks like as
1) TrainingModel.scala : Train the model and save the model in file.
2) CategoryPrediction.scala : Load the trained model from file and do prediction on test data.

Please suggest the solution:

TrainingModel.scala

val trainData: Dataset[LabeledRecord] = spark.read.option("inferSchema", "false")
  .schema(schema).csv("trainingdata1.csv").as[LabeledRecord]

val labelIndexer = new StringIndexer().setInputCol("category").setOutputCol("label").fit(trainData).setHandleInvalid("skip")

val tokenizer = new RegexTokenizer().setInputCol("text").setOutputCol("words")

val hashingTF = new HashingTF()
  .setInputCol("words")
  .setOutputCol("features")
  .setNumFeatures(1000)

val rf = new NaiveBayes().setLabelCol("label").setFeaturesCol("features").setModelType("multinomial")

val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, labelIndexer, rf))

val model = pipeline.fit(trainData)

model.write.overwrite().save("naivebayesmodel");

CategoryPrediction.scala

val testData: Dataset[PredictLabeledRecord] = spark.read.option("inferSchema", "false")
        .schema(predictSchema).csv("testingdata.csv").as[PredictLabeledRecord]

val model = PipelineModel.load("naivebayesmodel")

val predictions = model.transform(testData)

//      val labelConverter = new IndexToString()
//      .setInputCol("prediction")
//      .setOutputCol("predictedLabelString")
//      .setLabels(trainDataFrameIndexer.labels)    

predictions.select("prediction", "text").show(false)

trainingdata1.csv

category,text
Drama,"a b c d e spark"
Action,"b d"
Horror,"spark f g h"
Thriller,"hadoop mapreduce"

testingdata.csv

text
"a b c d e spark"
"spark f g h"
1

1 Answers

5
votes

Add a converter that will translate the prediction categories back to your labels in your pipeline, something like this:

val categoryConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("category")
  .setLabels(labelIndexer.labels)

val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, labelIndexer, rf, categoryConverter))

This will take the prediction and convert it back to a label using your labelIndexer.