Spark random forest binary classifier metrics

Question

How can we get model metrics when training a random forest binary classifier model in Spark Mllib (F score, AUROC, AUPRC etc.)?

The issue is that BinaryClassificationMetrics takes probabilities while the predict method of a RandomForest classifier returns discrete values 0 or 1.

See: https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html#binary-classification

A RandomForest.trainClassifier does not have any clearThreshold method which would make it return probabilities instead of discrete 0 or 1 labels.

Possible duplicate of Spark 1.5.1, MLLib random forest probability — eliasah
@eliasah Not actually a duplicate question but the answer there provides a/the solution for the question. I already added that in the answer before you commented. — Răzvan Flavius Panda
@eliasah That question is not actually duplicate as it does not ask about metrics. The answer there does point though to the new ml API which helps find a solution. See the updated answer with apache docs example tweaked to fit this question. — Răzvan Flavius Panda

Răzvan Flavius Panda Răzvan Flavius Panda · Accepted Answer · 2016-06-01T10:58:14

We need to use the new ml DataFrames based API to get the probabilities instead of the RDD based mllib API.

Update

Following is updated example from Spark documentation to use a BinaryClassificationEvaluator and display the metrics: Area Under Receiver Operating Characteristic (AUROC) and Area Under Precision Recall Curve (AUPRC).

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}

// Load and parse the data file, converting it to a DataFrame.
val data = sqlContext.read.format("libsvm").load("D:/Sources/spark/data/mllib/sample_libsvm_data.txt")

// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")
  .fit(data)

// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4)
  .fit(data)

// Split the data into training and test sets (30% held out for testing)
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

// Train a RandomForest model.
val rf = new RandomForestClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
  .setNumTrees(10)

// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels)

// Chain indexers and forest in a Pipeline
val pipeline = new Pipeline()
  .setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))

// Train model.  This also runs the indexers.
val model = pipeline.fit(trainingData)

// Make predictions.
val predictions = model.transform(testData)

// Select example rows to display.
predictions
  .select("indexedLabel", "rawPrediction", "prediction")
  .show()

val binaryClassificationEvaluator = new BinaryClassificationEvaluator()
  .setLabelCol("indexedLabel")
  .setRawPredictionCol("rawPrediction")

def printlnMetric(metricName: String): Unit = {
  println(metricName + " = " + binaryClassificationEvaluator.setMetricName(metricName).evaluate(predictions))
}

printlnMetric("areaUnderROC")
printlnMetric("areaUnderPR")

Spark random forest binary classifier metrics

1 Answers