6
votes

How can we get model metrics when training a random forest binary classifier model in Spark Mllib (F score, AUROC, AUPRC etc.)?

The issue is that BinaryClassificationMetrics takes probabilities while the predict method of a RandomForest classifier returns discrete values 0 or 1.

See: https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html#binary-classification

A RandomForest.trainClassifier does not have any clearThreshold method which would make it return probabilities instead of discrete 0 or 1 labels.

1
@eliasah Not actually a duplicate question but the answer there provides a/the solution for the question. I already added that in the answer before you commented.Răzvan Flavius Panda
It's ok. No problem ! Thus the use of the word "possible"eliasah
@eliasah That question is not actually duplicate as it does not ask about metrics. The answer there does point though to the new ml API which helps find a solution. See the updated answer with apache docs example tweaked to fit this question.Răzvan Flavius Panda

1 Answers

9
votes

We need to use the new ml DataFrames based API to get the probabilities instead of the RDD based mllib API.

Update

Following is updated example from Spark documentation to use a BinaryClassificationEvaluator and display the metrics: Area Under Receiver Operating Characteristic (AUROC) and Area Under Precision Recall Curve (AUPRC).

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}

// Load and parse the data file, converting it to a DataFrame.
val data = sqlContext.read.format("libsvm").load("D:/Sources/spark/data/mllib/sample_libsvm_data.txt")

// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")
  .fit(data)

// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4)
  .fit(data)

// Split the data into training and test sets (30% held out for testing)
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

// Train a RandomForest model.
val rf = new RandomForestClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
  .setNumTrees(10)

// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels)

// Chain indexers and forest in a Pipeline
val pipeline = new Pipeline()
  .setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))

// Train model.  This also runs the indexers.
val model = pipeline.fit(trainingData)

// Make predictions.
val predictions = model.transform(testData)

// Select example rows to display.
predictions
  .select("indexedLabel", "rawPrediction", "prediction")
  .show()

val binaryClassificationEvaluator = new BinaryClassificationEvaluator()
  .setLabelCol("indexedLabel")
  .setRawPredictionCol("rawPrediction")

def printlnMetric(metricName: String): Unit = {
  println(metricName + " = " + binaryClassificationEvaluator.setMetricName(metricName).evaluate(predictions))
}

printlnMetric("areaUnderROC")
printlnMetric("areaUnderPR")