1
votes

I try to use xgboost4j with spark 2.0.1 and the Dataset API. So far I obtained predictions in the following format by using model.transform(testData)

predictions.printSchema
root
 |-- label: double (nullable = true)
 |-- features: vector (nullable = true)
 |-- probabilities: vector (nullable = true)
 |-- prediction: double (nullable = true)


+-----+--------------------+--------------------+----------+
|label|            features|       probabilities|prediction|
+-----+--------------------+--------------------+----------+
|  0.0|[0.0,1.0,0.0,476....|[0.96766251325607...|       0.0|
|  0.0|[0.0,1.0,0.0,642....|[0.99599152803421...|       0.0|

But now I would like to generate evaluation metrics. How can I map the predictions to the right format? XGBoost-4j by DMLC on Spark-1.6.1 propose a similar problem, but I could not get it to work for me.

val metrics = new BinaryClassificationMetrics(predictions.select("prediction", "label").rdd)
would require RDD[(Double, Double)] 

instead of predictions.select("prediction", "label") which looks like

root
 |-- label: double (nullable = true)
 |-- prediction: double (nullable = true)

Tryping to map it to the required tuple like:

predictions.select("prediction", "label").map{case Row(_) => (_,_)}

fails to work as well.

edit

reading a bit more in sparks documentation I found http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.ml.evaluation.BinaryClassificationEvaluator which supports ml instead of ml-lib e.g. Datasets. So far I could not successfully integrate xgboost4j in a pipeline.

1

1 Answers

1
votes

Here is a good example https://github.com/dmlc/xgboost/blob/master/jvm-packages/xgboost4j-example/src/main/scala/ml/dmlc/xgboost4j/scala/example/spark/SparkModelTuningTool.scala how to use xgboost4j in a spark pipeline. In fact, they have an XGBoostEstimator which plays well in a pipeline.