I’m having some trouble understanding Spark’s cross validation. Any example I have seen uses it for parameter tuning, but I assumed that it would just do regular K-fold cross validation as well?
What I want to do is to perform k-fold cross validation, where k=5. I want to get the accuracy for each result and then get the average accuracy. In scikit learn this is how it would be done, where scores would give you the result for each fold, and then you can use scores.mean()
scores = cross_val_score(classifier, y, x, cv=5, scoring='accuracy')
This is how I am doing it in Spark, paramGridBuilder is empty as I don’t want to enter any parameters.
val paramGrid = new ParamGridBuilder().build()
val evaluator = new MulticlassClassificationEvaluator()
evaluator.setLabelCol("label")
evaluator.setPredictionCol("prediction")
evaluator.setMetricName("precision")
val crossval = new CrossValidator()
crossval.setEstimator(classifier)
crossval.setEvaluator(evaluator)
crossval.setEstimatorParamMaps(paramGrid)
crossval.setNumFolds(5)
val modelCV = crossval.fit(df4)
val chk = modelCV.avgMetrics
Is this doing the same thing as the scikit learn implementation? Why do the examples use training/testing data when doing cross validation?