After fixing the code I posted here (adding *sizeof(float) to shared memory allocation - but It doesn't matter since here I allocate shared memory through MATLAB), I ran the code, which successfully returned results of size up to sizeof(float)*18*18*5000*100 bytes.
I took the PTX, and used it to run the code though MATLAB (It found the right entry point - the function I wanted to run)
kernel=parallel.gpu.CUDAKernel('Tst.ptx','float *,const float *,int');
mask=gpuArray.randn([7,7,1],'single');
toConv=gpuArray.randn([12,12,5],'single'); %%generate random data for testing
setConstantMemory(kernel,'masks',mask); %%transfer data to constant memory.
kernel.ThreadBlockSize=[(12+2*7)-2 (12+2*7)-2 1];
kernel.GridSize=[1 5 1]; %%first element is how many convolution masks
%%second one is how many matrices we want to convolve
kernel.SharedMemorySize=(24*24*4);
foo=gpuArray.zeros([18 18 5 1],'single'); %%result size
foo=reshape(foo,[numel(foo) 1]);
toConv=reshape(toConv,[numel(toConv) 1]);
foo=feval(kernel,foo,toConv,12);
I get:
Error using parallel.gpu.CUDAKernel/feval An unexpected error occurred trying to launch a kernel. The CUDA error was: CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
Error in tst (line 12) foo=feval(kernel,foo,toConv,12);
out of resources for such a small example? It worked for a problem a hundred thousand times larger in Visual Studio...
I have GTX 480 (compute 2.0, about 1.5 GB memory, 1024 max threads per block, 48K shared memory)
1> ptxas : info : 0 bytes gmem, 25088 bytes cmem[2]
1> ptxas : info : Compiling entry function '_Z6myConvPfPKfi' for 'sm_21'
1> ptxas : info : Function properties for _Z6myConvPfPKfi
1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1> ptxas : info : Used 10 registers, 44 bytes cmem[0]
EDIT: problem resolved by compiling with Configuration Active(Release)
and Platform Active(x64)
CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES
means you are asking for too many per thread or per block resources (so registers, local memory or shared memory). Can you edit your question to include the output of compiling the kernel with-Xptxas="-v"
as an option to nvcc and tell us what GPU you have? Note that Matlab is compiling your kernel for you from PTX, it is likely that there is something different between the final code emitted by the two different compilation trajectories. – talonmies