I try to use xgboost4j with spark 2.0.1 and the Dataset API. So far I obtained predictions in the following format by using model.transform(testData)
predictions.printSchema
root
|-- label: double (nullable = true)
|-- features: vector (nullable = true)
|-- probabilities: vector (nullable = true)
|-- prediction: double (nullable = true)
+-----+--------------------+--------------------+----------+
|label| features| probabilities|prediction|
+-----+--------------------+--------------------+----------+
| 0.0|[0.0,1.0,0.0,476....|[0.96766251325607...| 0.0|
| 0.0|[0.0,1.0,0.0,642....|[0.99599152803421...| 0.0|
But now I would like to generate evaluation metrics. How can I map the predictions to the right format? XGBoost-4j by DMLC on Spark-1.6.1 propose a similar problem, but I could not get it to work for me.
val metrics = new BinaryClassificationMetrics(predictions.select("prediction", "label").rdd)
would require RDD[(Double, Double)]
instead of predictions.select("prediction", "label")
which looks like
root
|-- label: double (nullable = true)
|-- prediction: double (nullable = true)
Tryping to map it to the required tuple like:
predictions.select("prediction", "label").map{case Row(_) => (_,_)}
fails to work as well.
edit
reading a bit more in sparks documentation I found http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.evaluation.BinaryClassificationEvaluator which supports ml instead of ml-lib e.g. Datasets. So far I could not successfully integrate xgboost4j in a pipeline.