Cross-validation is one of those embarrassingly parallel problems.
Let's say you would like to cross-validate a linear regression model. Assume that the design matrix X has dimensions n-by-p and the continuous outcome y is an n-by-1 vector. Further assume that foldMatrix is an n-by-k matrix of logicals. Each column represents a partition: where a 1 indicates an observation is used for training and a 0 denotes it is used for validation. This train-validate trick is repeated k times so as to reduce the variance of the generalization error (GE) estimate.
My (naive?) approach to parallel cross-validation in Matlab would look like:
matlabpool
GE = nan(k,1);
parfor i = 1:k
trainIndices = foldMatrix(:, i);
b = X(trainIndices, :)\y(trainIndices, :);
GE(i) = mean( (y(~trainIndices, :) - X(~trainIndices, :)*b).^2 );
end
mspe = mean(GE);
When you do this, Matlab will complain that "X is indexed but not sliced in a PARFOR loop. This might result in unnecessary communication overhead" (idem for y).
My questions are:
- EDIT: Is there any way to speed-up cross-validation using a parallel implementation in Matlab?
- Is there an efficient/elegant way to solve the issue of the
Xandyvariables not being sliced?
Two "solutions" that do not seem very elegant to me are:
Ignore the nag. For small problems, say
p < 100,n < 3000andk < 40, a sequential implementation is faster than the parallel one.Pre-allocate the train-validate partitions "explicitly" in a cell array or 3-dimensional matrix. Resulting in
kfull copies of the data (Xandy).

