3
votes

I'm using parallel.gpu.CUDAKernel to launch CUDA kernels in Matlab 2011a. I've designed my code such that the same gpuArray should be populated by subsequent kernel launches within a loop, but each launch restricts itself to a unique segment of the gpuArray.

By the end of execution, the entire array should be full. However, when I transfer the memory back to the host with gather(), only the memory written to by the last kernel launch is correct; everything else is blank. This is also true if I break out of the loop somewhere in the middle.

I have verified that this is indeed the case by passing in a flag to indicate the kernel iteration. If it is anything except the first iteration, then the kernel does nothing. However, the data locations written to by the first kernel are still empty, even though subsequent kernels do nothing! This is not the case if I break out of the loop directly after launching the first kernel.

Thus, it seems to me that Matlab is resetting the gpuArray between kernel launches. Is there a way to prevent it from doing so?

1
The gpuArray's in the Parallel Computing Toolbox are not very functional. You're better off using Jacket. While I am biased because I work on Jacket, I'm not kidding when I say that you shouldn't waste your time with gpuArray's. If you're not going to use Jacket, you're better off sticking to the CPU or writing all your own CUDA code. - arrayfire
It seems like a great product. Unfortunately, being a student I'm limited to free software and software provided by my university. For now I'm going try writing a mex interface instead. - Richard
Sounds good. If you send a note to your IT department asking them to purchase Jacket, they might just do so. In fact, they may already have a license for Jacket (most universities have some Jacket licenses by now). - arrayfire

1 Answers

2
votes

This should work, providing you capture the output of the feval call. Consider a trivial kernel like this:

__global__ void setOneEl( double * array, double val, int element ) {
    array[element] = val;
}

Then, running the following code in MATLAB works as I believe you're after:

>> k = parallel.gpu.CUDAKernel('kern.ptx');
>> g = parallel.gpu.GPUArray.zeros(1,10);
>> for ii = 1:2:10, g = k.feval(g, rand, ii); end
>> gather(g)
ans =
         0    0.0975         0    0.2785         0    0.5469         0    0.9575         0    0.9649

To be consistent with ordinary MATLAB semantics, gpuArray objects are value-based, therefore when you wish to modify a gpuArray instance, you must capture the output value back into the same array, as you would with any other MATLAB data type. However, note that the CUDAKernel.feval call understands when you're capturing the result into the same variable, and can use in-place optimization to avoid making copies.