CUDA: use of shared memory to return an array from device functions

Question

is it possible to allocate shared memory for a kernel (inside or extern) and use it in other device functions called from the kernel? Specially interesting for me will be, if/how i can use it as a returned parameter/array.

It seems to be no problem to use shared memory as input parameter in device functions (at least i get no problems, errors or unexpected results.

When I use it as a return parameter, I get several problems:

I can run the program when it was built from debug configuration.
But i can't debug it -> it crashes in the device functions when i use the shared memory
Also i get errors with cuda-memchecker -> invalid __global__ read because address is out of bound an it read from shared address space

So is it possible to use shared memory for returning arrays from device functions to kernels?

EDIT:

I wrote a very simple example to exclude other errors done by me.

#define CUDA_CHECK_RETURN(value) {                                      \
    cudaError_t _m_cudaStat = (value);                                  \
    if (_m_cudaStat != cudaSuccess) {                                   \
        printf( "Error %s at line %d in file %s\n",                     \
                cudaGetErrorString(_m_cudaStat), __LINE__, __FILE__);   \
        exit(-1);                                                       \
    } }

__device__ void Function( const int *aInput, volatile int *aOutput )
{
    for( int i = 0; i < 10; i++ )
        aOutput[i] = aInput[i] * aInput[i];
}

__global__ void Kernel( int *aInOut )
{
     __shared__ int aShared[10];

    for(int i=0; i<10; i++)
        aShared[i] = i+1;

    Function( aShared, aInOut );
}

int main( int argc, char** argv )
{
    int *hArray = NULL;
    int *dArray = NULL;

    hArray = ( int* )malloc( 10*sizeof(int) );
    CUDA_CHECK_RETURN( cudaMalloc( (void**)&dArray, 10*sizeof(int) ) );

    for( int i = 0; i < 10; i++ )
            hArray[i] = i+1;

    CUDA_CHECK_RETURN( cudaMemcpy( dArray, hArray, 10*sizeof(int), cudaMemcpyHostToDevice ) );
    cudaMemcpy( dArray, hArray, 10*sizeof(int), cudaMemcpyHostToDevice );

    Kernel<<<1,1>>>( dArray );

    CUDA_CHECK_RETURN( cudaMemcpy( hArray, dArray, 10*sizeof(int), cudaMemcpyDeviceToHost ) );
    cudaMemcpy( hArray, dArray, 10*sizeof(int), cudaMemcpyDeviceToHost );

    free( hArray );
    CUDA_CHECK_RETURN( cudaFree( dArray ) );
    cudaFree( dArray );

    return 0;
}

I excecute the kernel by one threadblock and one thread per block. It's no problem to build the program and run it. I get the expected results. But if the program is testet with cuda-memchecker it terminates the kernel and following log appears.

Error unspecified launch failure at line 49 in file ../CuTest.cu
========= Invalid __global__ read of size 4
=========     at 0x00000078 in /home/strautz/Develop/Software/CuTest/Debug/../CuTest.cu:14:Function(int const *, int volatile *)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x01000000 is out of bounds
=========     Device Frame:/home/strautz/Develop/Software/CuTest/Debug/../CuTest.cu:25:Kernel(int*) (Kernel(int*) : 0xd0)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/libcuda.so (cuLaunchKernel + 0x34b) [0x55d0b]
=========     Host Frame:/usr/lib/libcudart.so.5.0 [0x8f6a]
=========
========= Program hit error 4 on CUDA API call to cudaMemcpy 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/libcuda.so [0x24e129]
=========     Host Frame:/usr/lib/libcudart.so.5.0 (cudaMemcpy + 0x2bc) [0x3772c]
=========     Host Frame:[0x5400000]
=========
========= ERROR SUMMARY: 2 errors

Does the shared memory have to be aligned, do I have to do something else or can it be ignored - don't think so?

Seems your Memcpy has problem too. How you allocate the device mem and launch the kernel? It will be perfect if you show a complete .cu file. — kangshiyin
I updatet once more my post with the whole program. But I'm not at my working pc and currently on this pc I don't even have cuda installed. So the syntax could be wrong (can't check it at the moment) but this is my little test program - I will update the post once more with the right syntax tomorrow, when I will be back at my working pc. — hubs
In the test program I don't error checking, right - I will add it tomorow. But in the "big" program I do it. It can be that I forgot it at some places, I don't hope it. I'll also gonna check this timorrow, too. — hubs
In the original post the whole cu-file of the test is posted now, but as I said, I still get the error with cuda-memchecker and since I added cuda errorchecking I get an unspecific launch failure by the second cudaMemcpy? Do you have any idea why you can't test it without an error and I can't do it? I'm trying to extract some code of the original program. This gonna be difficult because it's a big program and I'm not allowed to show all of the code. — hubs

kangshiyin kangshiyin · Accepted Answer · 2013-01-14T12:58:57

see CUDA 5.0 installation file /usr/local/cuda-5.0/samples/6_Advanced/reduction/doc/reduction.ppt

sdata is a local var of device function warpReduce(). It stores the addr of the shared mem. The shared mem can be read/write by the addr within the device function. The final reduction result is then read from shared mem outside warpReduce()

template <unsigned int blockSize>
__device__ void warpReduce(volatile int *sdata, unsigned int tid) {
    if (blockSize >=  64) sdata[tid] += sdata[tid + 32];
    if (blockSize >=  32) sdata[tid] += sdata[tid + 16];
    if (blockSize >=  16) sdata[tid] += sdata[tid +  8];
    if (blockSize >=   8) sdata[tid] += sdata[tid +  4];
    if (blockSize >=   4) sdata[tid] += sdata[tid +  2];
    if (blockSize >=   2) sdata[tid] += sdata[tid +  1];
}
template <unsigned int blockSize>
__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) {
    extern __shared__ int sdata[];
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*(blockSize*2) + tid;
    unsigned int gridSize = blockSize*2*gridDim.x;
    sdata[tid] = 0;

    while (i < n) { sdata[tid] += g_idata[i] + g_idata[i+blockSize];  i += gridSize;  }
    __syncthreads();

    if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
    if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
    if (blockSize >= 128) { if (tid <  64) { sdata[tid] += sdata[tid +  64]; } __syncthreads(); }

    if (tid < 32) warpReduce(sdata, tid);
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

CUDA: use of shared memory to return an array from device functions

2 Answers