3
votes

(correctly and instructively asnwered, see below)

I'm beginning to do experiments with matlab and gpu (nvidia gtx660).

Now, I wrote this simple monte carlo algorithm to calculate PI. The following is the CPU version:

function pig = mc1vecnocuda(n)
countr=0;
A=rand(n,2);
 for i=1:n

   if norm(A(i,:))<1
    countr=countr+1;
   end
 end
pig=(countr/n)*4;
end

This takes very little time to be executed on CPU with 100000 points "thrown" into the unit circle:

   >> tic; mc1vecnocuda(100000);toc;

      Elapsed time is 0.092473 seconds.

See, instead, what happens with gpu-ized version of the algorithm:

   function pig = mc1veccuda(n)
   countr=0;
   gpucountr=gpuArray(countr);
   A=gpuArray.rand(n,2);
   parfor (i=1:n,1024)
    if norm(A(i,:))<1
        gpucountr=gpucountr+1;
    end
   end

   pig=(gpucountr/n)*4;
   end

Now, this takes a LONG time to be executed:

>> tic; mc1veccuda(100000);toc;
Elapsed time is 21.137954 seconds.

I don't understand why. I used parfor loop with 1024 workers, because querying my nvidia card with gpuDevice, 1024 is the maximum number of simultaneous threads allowed on the gtx660.

Can someone help me? Thanks.

Edit: this is the updated version that avoids IF:

function pig = mc2veccuda(n)
countr=0;
gpucountr=gpuArray(countr);
A=gpuArray.rand(n,2);
parfor (i=1:n,1024)

    gpucountr = gpucountr+nnz(norm(A(i,:))<1);

end

pig=(gpucountr/n)*4;
end

And this is the code written following Bichoy's guidelines (the right code to achieve result):

function pig = mc3veccuda(n)
countr=0;
gpucountr=gpuArray(countr);
A=gpuArray.rand(n,2);
Asq = A.^2;
Asqsum_big_column = Asq(:,1)+Asq(:,2);
Anorms=Asqsum_big_column.^(1/2);
gpucountr=gpucountr+nnz(Anorms<1);

pig=(gpucountr/n)*4;
end

Please note execution time with n=10 millions:

>> tic; mc3veccuda(10000000); toc;
Elapsed time is 0.131348 seconds.
>> tic; mc1vecnocuda(10000000); toc;
Elapsed time is 8.108907 seconds.

I didn't test my original cuda version (for/parfor), for its execution would require hours with n=10000000.

Great Bichoy! ;)

3
I'm pretty sure that something is moved from device to host and back again, during the execution of the for loop.. There is any way to watch if (and when) a transfer device <--> host occurs??MadHatter
This is probably not the best way to do it. The whole thing could be vectorized as: pig = nnz(sum(X.^2,2)<=1) * 4 / size(X,1) where X=rand(N,2) or similar array on the gpuAmro
off-topic, but here is a nice animation of this simulation: pastebin.com/w0N46vJE :) Similar to: en.wikipedia.org/wiki/File:Pi_30K.gifAmro
Your link is welcome!MadHatter

3 Answers

3
votes

I guess the problem is with parfor!

parfor is supposed to run on MATLAB workers, that is your host not the GPU! I guess what is actually happening is that you are starting 1024 threads on your host (not on your GPU) and each of them is trying to call the GPU. This result in the tremendous time your code is taking.

Try to re-write your code to use matrix and array operations, not for-loops! This will show some speed-up. Also, remember that you should have much more calculations to do in the GPU otherwise, memory transfer will just dominate your code.

Code:

This is the final code after including all corrections and suggestions from several people:

function pig = mc2veccuda(n)
  A=gpuArray.rand(n,2); % An nx2 random matrix
  Asq = A.^2; % Get the square value of each element
  Anormsq = Asq(:,1)+Asq(:,2); % Get the norm squared of each point
  gpucountr = nnz(Anorm<1); % Check the number of elements < 1
  pig=(gpucountr/n)*4;
1
votes

Many reasons like:

  1. Movement of data between host & device
  2. Computation within each loop is very small
  3. Call to rand on GPU may not be parallel
  4. if condition within the loop can cause divergence
  5. Accumulation to a common variable may run in serial, with overhead

It is difficult to profile Matlab+CUDA code. You should probably try in native C++/CUDA and use parallel Nsight to find the bottleneck.

1
votes

As Bichoy said, CUDA code should always be done vectorized. In MATLAB, unless you're writing a CUDA Kernal, the only large speedup that you're getting is that the vectorized operations are called on the GPU which has thousands of (slow) cores. If you don't have large vectors and vectorized code, it won't help.


Another thing that hasn't been mentioned is that for highly parallel architectures like GPUs you want to use different random number generating algorithms than the "standard" ones. So to add to Bichoy's answer, adding the parameter 'Threefry4x64' (64-bit) or 'Philox4x32-10' (32-bit and a lot faster! Super fast!) can lead to large speedups in CUDA code. MATLAB explains this here: http://www.mathworks.com/help/distcomp/examples/generating-random-numbers-on-a-gpu.html