How to evaluate binary classifier evaluation metrics per group (in scala)?

Question

I have a dataframe, which stores the scores and labels for various binary classification class problem that I have. For example:

| problem | score | label |
|:--------|:------|-------|
| a       | 0.8   | true  |  
| a       | 0.7   | true  |  
| a       | 0.2   | false |  
| b       | 0.9   | false |  
| b       | 0.3   | true  |  
| b       | 0.1   | false |  
| ...     | ...   | ...   |

Now my goal is to get binary evaluation metrics (take AreaUnderROC for example, see https://spark.apache.org/docs/2.2.0/mllib-evaluation-metrics.html#binary-classification) for each problem, with end result being something like:

| problem | areaUnderROC |
| a       | 0.83         |
| b       | 0.68         |
| ...     | ...          |

I thought about doing something like:

df.groupBy("problem").agg(getMetrics)

but then I am not sure how to write getMetrics in terms of Aggregators (see https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html). Any suggestions?

Steven Black Steven Black · Accepted Answer · 2018-03-23T18:25:12

There's a module built just for binary metrics - see it in the python docs

This code should work,

from pyspark.mllib.evaluation import BinaryClassificationMetrics

score_and_labels_a = df.filter("problem = 'a'").select("score", "label")
metrics_a = BinaryClassificationMetrics(score_and_labels)
print(metrics_a.areaUnderROC)
print(metrics_a.areaUnderPR)

score_and_labels_b = df.filter("problem = 'b'").select("score", "label")
metrics_b = BinaryClassificationMetrics(score_and_labels)
print(metrics_b.areaUnderROC)
print(metrics_b.areaUnderPR)

... and so on for the other problems

This seems to me to be the easiest way :)

How to evaluate binary classifier evaluation metrics per group (in scala)?

2 Answers