1
votes

After fixing the code I posted here (adding *sizeof(float) to shared memory allocation - but It doesn't matter since here I allocate shared memory through MATLAB), I ran the code, which successfully returned results of size up to sizeof(float)*18*18*5000*100 bytes.

I took the PTX, and used it to run the code though MATLAB (It found the right entry point - the function I wanted to run)

    kernel=parallel.gpu.CUDAKernel('Tst.ptx','float *,const float *,int');
    mask=gpuArray.randn([7,7,1],'single');
    toConv=gpuArray.randn([12,12,5],'single'); %%generate random data for testing
    setConstantMemory(kernel,'masks',mask);  %%transfer data to constant memory.
    kernel.ThreadBlockSize=[(12+2*7)-2 (12+2*7)-2 1];
    kernel.GridSize=[1 5 1]; %%first element is how many convolution masks
    %%second one is how many matrices we want to convolve
    kernel.SharedMemorySize=(24*24*4);
    foo=gpuArray.zeros([18 18 5 1],'single'); %%result size
    foo=reshape(foo,[numel(foo) 1]);
    toConv=reshape(toConv,[numel(toConv) 1]);
    foo=feval(kernel,foo,toConv,12);

I get:

Error using parallel.gpu.CUDAKernel/feval An unexpected error occurred trying to launch a kernel. The CUDA error was: CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES

Error in tst (line 12) foo=feval(kernel,foo,toConv,12);

out of resources for such a small example? It worked for a problem a hundred thousand times larger in Visual Studio...

I have GTX 480 (compute 2.0, about 1.5 GB memory, 1024 max threads per block, 48K shared memory)

1>  ptxas : info : 0 bytes gmem, 25088 bytes cmem[2]
1>  ptxas : info : Compiling entry function '_Z6myConvPfPKfi' for 'sm_21'
1>  ptxas : info : Function properties for _Z6myConvPfPKfi
1>      0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1>  ptxas : info : Used 10 registers, 44 bytes cmem[0]

EDIT: problem resolved by compiling with Configuration Active(Release) and Platform Active(x64)

1
CUDA_ERROR_LAUNCH_OUT_OF_RESOURCES means you are asking for too many per thread or per block resources (so registers, local memory or shared memory). Can you edit your question to include the output of compiling the kernel with -Xptxas="-v" as an option to nvcc and tell us what GPU you have? Note that Matlab is compiling your kernel for you from PTX, it is likely that there is something different between the final code emitted by the two different compilation trajectories.talonmies
Also note that the guide you link to shows how to interrogate the Matlab kernel structure to see the kernel properties (I would be paying careful attention to the value of MaxThreadsPerBlock, for example).talonmies
Edited as you requested. And I know MATLAB shows me information - and I kept all the information in mind. note that when I'm trying to run the code through MATLAB, i'm taking up less space in constant memory than I used through Visual studio. shared memory usage per thread remains the same, and still well below maximum. The error appears after allocation of all variables, and with the small sizes I use, there is no way it's out of global memory.user1999728
This has nothing to do with memory. It most likely threads per block, and probably because of a difference between the PTX you are feeding to Matlab and code you compiled to binary inside VS. The default architecture for the CUDA toolchain only supports 512 threads per block. If you have compiled your kernel to PTX 1.x, the resulting code Matlab will try and run might be limited to 512 threads. You are trying to run 576. The error you are reporting is consistent with that.talonmies
Don't edit the solution into your question. Add it as an answer (this is perfectly OK here). Later you will be able to accept you own answer, which shows that the question is answered and gets it off the unanswered questions list.talonmies

1 Answers

1
votes

problem resolved by compiling with Configuration Active(Release) and Platform Active(x64) instead of default (Due to backwards compatibility, I'm guessing it's not about the x64 as much as about compiling for release and not for debug)