I am trying to use the XGBoost implementation by DMLC on Spark-1.6.1. I am able to train my data with XGBoost but facing difficulties in prediction. I actually want to do prediction in the way it can be done in Apache Spark mllib libraries, that helps with calculation of training error,precision, recall, specificity etc.
I am posting the codes below, also the error I am getting. I used this xgboost4j-spark-0.5-jar-with-dependencies.jar in spark-shell to start.
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.SparkContext._
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.regression.LabeledPoint
import ml.dmlc.xgboost4j.scala.Booster
import ml.dmlc.xgboost4j.scala.spark.XGBoost
import ml.dmlc.xgboost4j.scala.DMatrix
import ml.dmlc.xgboost4j.scala.{Booster, DMatrix}
import ml.dmlc.xgboost4j.scala.spark.{DataUtils, XGBoost}
import org.apache.spark.{SparkConf, SparkContext}
//Load and parse the data file.
val data = sc.textFile("file:///home/partha/credit_approval_2_attr.csv")
val data1 = sc.textFile("file:///home/partha/credit_app_fea.csv")
val parsedData = data.map { line =>
val parts = line.split(',').map(_.toDouble)
LabeledPoint(parts(0), Vectors.dense(parts.tail))
}.cache()
val parsedData1 = data1.map { line =>
val parts = line.split(',').map(_.toDouble)
Vectors.dense(parts)
}
//Tuning Parameters
val paramMap = List(
"eta" -> 0.1f,
"max_depth" -> 5,
"num_class" -> 2,
"objective" -> "multi:softmax" ,
"colsample_bytree" -> 0.8,
"alpha" -> 1,
"subsample" -> 0.5).toMap
//Training the model
val numRound = 20
val model = XGBoost.train(parsedData, paramMap, numRound, nWorkers = 1)
val pred = model.predict(parsedData1)
pred.collect()
Output from pred:
res0: Array[Array[Array[Float]]] = Array(Array(Array(0.0), Array(1.0), Array(1.0), Array(1.0), Array(0.0), Array(0.0), Array(1.0), Array(1.0), Array(0.0), Array(1.0), Array(0.0), Array(0.0), Array(0.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(0.0), Array(1.0), Array(1.0), Array(0.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(0.0), Array(1.0), Array(1.0), Array(1.0), Array(0.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(1.0), Array(0.0), Array(0.0), Array(0.0), Array(0.0), Array(1.0), Array(0.0), Array(0.0), Array(0.0), Array(0.0), Array(0.0), Array(0.0), Array(1.0), Array(1.0), Array(1.0), Array(...
Now when I am using:
val labelAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
Output:
<console>:66: error: overloaded method value predict with alternatives:
(testSet: ml.dmlc.xgboost4j.scala.DMatrix)Array[Array[Float]] <and>
(testSet: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector])org.apache.spark.rdd.RDD[Array[Array[Float]]]
cannot be applied to (org.apache.spark.mllib.linalg.Vector)
val prediction = model.predict(point.features)
^
And then tried this, since predict requires an RDD[Vector].
val labelAndPreds1 = parsedData.map { point =>
val prediction = model.predict(Vectors.dense(point.features))
(point.label, prediction)
}
The outcome was:
<console>:66: error: overloaded method value dense with alternatives:
(values: Array[Double])org.apache.spark.mllib.linalg.Vector <and>
(firstValue: Double,otherValues: Double*)org.apache.spark.mllib.linalg.Vector
cannot be applied to (org.apache.spark.mllib.linalg.Vector)
val prediction = model.predict(Vectors.dense(point.features))
^
Clearly its an issue of RDD type which I am trying to sort out, this is easy with GBT on spark (http://spark.apache.org/docs/latest/mllib-ensembles.html#gradient-boosted-trees-gbts).
Am I trying to do this the right way?
Any help or suggestion would be awesome.