11
votes

I’m having some trouble understanding Spark’s cross validation. Any example I have seen uses it for parameter tuning, but I assumed that it would just do regular K-fold cross validation as well?

What I want to do is to perform k-fold cross validation, where k=5. I want to get the accuracy for each result and then get the average accuracy. In scikit learn this is how it would be done, where scores would give you the result for each fold, and then you can use scores.mean()

scores = cross_val_score(classifier, y, x, cv=5, scoring='accuracy')

This is how I am doing it in Spark, paramGridBuilder is empty as I don’t want to enter any parameters.

val paramGrid = new ParamGridBuilder().build()
val evaluator = new MulticlassClassificationEvaluator()
  evaluator.setLabelCol("label")
  evaluator.setPredictionCol("prediction")
evaluator.setMetricName("precision")


val crossval = new CrossValidator()
crossval.setEstimator(classifier)
crossval.setEvaluator(evaluator) 
crossval.setEstimatorParamMaps(paramGrid)
crossval.setNumFolds(5)


val modelCV = crossval.fit(df4)
val chk = modelCV.avgMetrics

Is this doing the same thing as the scikit learn implementation? Why do the examples use training/testing data when doing cross validation?

How to cross validate RandomForest model?

https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/ml/ModelSelectionViaCrossValidationExample.scala

1

1 Answers

3
votes
  1. What you're doing looks ok.
  2. Basically, yes, it works the same as sklearn's grid search CV.
    For each EstimatorParamMaps (a set of params), the algorithm is tested with CV so avgMetrics is average cross-validation accuracy metric/s on all folds. In case one is using empty ParamGridBuilder (no params search), it's like having "regular" cross validation" and we that will result one cross-validated training accuracy.
  3. Each CV iteration includes K-1 training folds and 1 test fold, so why most examples separate the data to training/testing data before doing cross validation? because the test folds inside the CV are used for params grid search. That means additional validation dataset is needed for model selection. So what is called "test dataset" is needed to evaluate the final model. Read more here