0
votes

I'm stuck on this for the whole day. The following program will give "Out of range shared or local address" error. Commenting out this line will solve this problem.

hist[tidx] = 0;

However, I don't think allocating shared memory of size 88*4 byte will be any problem.

Commenting out this line will also solve the problem

NVMatrix Acts(acts, true);

It seems if I allocate the Acts matrix in the global memory, the shared memory will behave abnormal. Any idea?

int main(int argc, char ** argv)
{
    float * act = new float[2985984];
    for (int i=0; i<2985984; i++)
        act[i] = 0.0001*(i+1);

    Matrix acts(act, 23328, 128);   // use act as the data to initialize the 23328x128, matrix in cpu

    NVMatrix Acts(acts, true);      // create a Acts Matrix which uses GPU global memory, and copies the value from CPU to GPU
                                    // If comment out this line, there is no problem to execute the program

    float cost = Calculate();

}

float Calculate()
{
    dim3 blocks(4,96);
    dim3 threads(32,8);

    cudaFuncSetCacheConfig(createShare<8, 32>, cudaFuncCachePreferShared);

    int numLabels = 88;

    createShare<8, 32><<<blocks, threads, numLabels>>>(numLabels);

    return 0;
}

template <int B_Y, int B_X>
__global__ void createShare(int numLabels)
{
    extern __shared__ float hist[];

    int tidx = threadIdx.y * B_X + threadIdx.x;
    if (tidx<numLabels) {
        printf("block %d %d %d\n", blockIdx.x, blockIdx.y, tidx);
        hist[tidx] = 0;
    }
}
1

1 Answers

6
votes

Change this:

createShare<8, 32><<<blocks, threads, numLabels>>>(numLabels);

to this:

createShare<8, 32><<<blocks, threads, numLabels*sizeof(float)>>>(numLabels);

The size of dynamic shared allocation that you are passing to the kernel is in bytes. You need to allocate enough bytes to cover 88 float quantities.