0
votes

Cross-validation is one of those embarrassingly parallel problems.

Let's say you would like to cross-validate a linear regression model. Assume that the design matrix X has dimensions n-by-p and the continuous outcome y is an n-by-1 vector. Further assume that foldMatrix is an n-by-k matrix of logicals. Each column represents a partition: where a 1 indicates an observation is used for training and a 0 denotes it is used for validation. This train-validate trick is repeated k times so as to reduce the variance of the generalization error (GE) estimate.

My (naive?) approach to parallel cross-validation in Matlab would look like:

matlabpool

GE = nan(k,1);  

parfor i = 1:k

   trainIndices = foldMatrix(:, i);
   b = X(trainIndices, :)\y(trainIndices, :);

   GE(i) = mean( (y(~trainIndices, :)  - X(~trainIndices, :)*b).^2 );

end

mspe = mean(GE);

When you do this, Matlab will complain that "X is indexed but not sliced in a PARFOR loop. This might result in unnecessary communication overhead" (idem for y).

My questions are:

  • EDIT: Is there any way to speed-up cross-validation using a parallel implementation in Matlab?
  • Is there an efficient/elegant way to solve the issue of the X and y variables not being sliced?

Two "solutions" that do not seem very elegant to me are:

  1. Ignore the nag. For small problems, say p < 100, n < 3000 and k < 40, a sequential implementation is faster than the parallel one.

  2. Pre-allocate the train-validate partitions "explicitly" in a cell array or 3-dimensional matrix. Resulting in k full copies of the data (X and y).

1

1 Answers

0
votes

Ignore the nag.

The code analyzer warning is simply there to make sure that you know what you're doing. A lot of parallel problems only do one thing with separate chunks of data, so TMW wants you to know you're reusing some of the data.

Think about it this way: the data has to get to the correct processor somehow. You can either duplicate it in memory, or let the processor request the same piece of memory every time. That duplication takes time though, which we don't want.

Here's a little script we can use to check this:

n = 20000;
p = 6;
k = 1000;

N = [30 * (1:9), 300 * (1:9), 3000 * (1:10)];

timing = nan(length(N), 2);

gcp;

for iN = 1:length(N)
    n = N(iN);

    X = rand(n, p);
    beta = rand(p, 1);
    y= X * beta;

    foldMatrix = logical(rand(n,k) > 0.5);

    GE = nan(k,1);

    tic;
    parfor i = 1:k
        trainIndices = foldMatrix(:, i);
        b = X(trainIndices, :)\y(trainIndices, :); %#ok<*PFBNS>

        GE(i) = mean( (y(~trainIndices, :)  - X(~trainIndices, :)*b).^2 );
    end
    timing(iN, 1) = toc;

    tic;
    X_rep = repmat(X, [1 1 k]);
    y_rep = repmat(y, [1 1 k]);
    parfor i = 1:k
        trainIndices = foldMatrix(:, i);
        b = X(trainIndices, :)\y(trainIndices, :);

        GE(i) = mean( (y(~trainIndices, :)  - X(~trainIndices, :)*b).^2 );
    end
    timing(iN, 2) = toc;
end

If you plot timing against N, plot(N, timing):

n

The blue line is with the code analyzer warning, the orange is with the repmat. The same thing holds across the k dimension:

enter image description here

So save yourself the computation and just ignore the warning.You can add %#ok<PFBNS> to the end of the lines where it comes up, or %#ok<*PFBNS> anywhere in the file to get them to stop showing up.