Matlab CUDA basic experiment

Question

(correctly and instructively asnwered, see below)

I'm beginning to do experiments with matlab and gpu (nvidia gtx660).

Now, I wrote this simple monte carlo algorithm to calculate PI. The following is the CPU version:

function pig = mc1vecnocuda(n)
countr=0;
A=rand(n,2);
 for i=1:n

   if norm(A(i,:))<1
    countr=countr+1;
   end
 end
pig=(countr/n)*4;
end

This takes very little time to be executed on CPU with 100000 points "thrown" into the unit circle:

   >> tic; mc1vecnocuda(100000);toc;

      Elapsed time is 0.092473 seconds.

See, instead, what happens with gpu-ized version of the algorithm:

   function pig = mc1veccuda(n)
   countr=0;
   gpucountr=gpuArray(countr);
   A=gpuArray.rand(n,2);
   parfor (i=1:n,1024)
    if norm(A(i,:))<1
        gpucountr=gpucountr+1;
    end
   end

   pig=(gpucountr/n)*4;
   end

Now, this takes a LONG time to be executed:

>> tic; mc1veccuda(100000);toc;
Elapsed time is 21.137954 seconds.

I don't understand why. I used parfor loop with 1024 workers, because querying my nvidia card with gpuDevice, 1024 is the maximum number of simultaneous threads allowed on the gtx660.

Can someone help me? Thanks.

Edit: this is the updated version that avoids IF:

function pig = mc2veccuda(n)
countr=0;
gpucountr=gpuArray(countr);
A=gpuArray.rand(n,2);
parfor (i=1:n,1024)

    gpucountr = gpucountr+nnz(norm(A(i,:))<1);

end

pig=(gpucountr/n)*4;
end

And this is the code written following Bichoy's guidelines (the right code to achieve result):

function pig = mc3veccuda(n)
countr=0;
gpucountr=gpuArray(countr);
A=gpuArray.rand(n,2);
Asq = A.^2;
Asqsum_big_column = Asq(:,1)+Asq(:,2);
Anorms=Asqsum_big_column.^(1/2);
gpucountr=gpucountr+nnz(Anorms<1);

pig=(gpucountr/n)*4;
end

Please note execution time with n=10 millions:

>> tic; mc3veccuda(10000000); toc;
Elapsed time is 0.131348 seconds.
>> tic; mc1vecnocuda(10000000); toc;
Elapsed time is 8.108907 seconds.

I didn't test my original cuda version (for/parfor), for its execution would require hours with n=10000000.

Great Bichoy! ;)

I'm pretty sure that something is moved from device to host and back again, during the execution of the for loop.. There is any way to watch if (and when) a transfer device <--> host occurs?? — MadHatter
This is probably not the best way to do it. The whole thing could be vectorized as: pig = nnz(sum(X.^2,2)<=1) * 4 / size(X,1) where X=rand(N,2) or similar array on the gpu — Amro
off-topic, but here is a nice animation of this simulation: pastebin.com/w0N46vJE :) Similar to: en.wikipedia.org/wiki/File:Pi_30K.gif — Amro

Bichoy Bichoy · Accepted Answer · 2013-04-13T17:19:35

I guess the problem is with parfor!

parfor is supposed to run on MATLAB workers, that is your host not the GPU! I guess what is actually happening is that you are starting 1024 threads on your host (not on your GPU) and each of them is trying to call the GPU. This result in the tremendous time your code is taking.

Try to re-write your code to use matrix and array operations, not for-loops! This will show some speed-up. Also, remember that you should have much more calculations to do in the GPU otherwise, memory transfer will just dominate your code.

Code:

This is the final code after including all corrections and suggestions from several people:

function pig = mc2veccuda(n)
  A=gpuArray.rand(n,2); % An nx2 random matrix
  Asq = A.^2; % Get the square value of each element
  Anormsq = Asq(:,1)+Asq(:,2); % Get the norm squared of each point
  gpucountr = nnz(Anorm<1); % Check the number of elements < 1
  pig=(gpucountr/n)*4;

Matlab CUDA basic experiment

3 Answers

Code: