3
votes

I recently discovered the racecheck tool of cuda-memcheck, available in CUDA 5.0 (cuda-memcheck --tool racecheck, see the NVIDIA doc). This tool can detect race conditions with shared memory in a CUDA kernel.

In debug mode, this tool does not detect anything, which is apparently normal. However, in release mode (-O3), I get errors depending on the parameters of the problem.

Here is an error example (initialization of shared memory on line 22, assignment on line 119):

========= ERROR: Potential WAW hazard detected at shared 0x0 in block (35, 0, 0) : ========= Write Thread (32, 0, 0) at 0x00000890 in ....h:119:void kernel_test3(Data*) ========= Write Thread (0, 0, 0) at 0x00000048 in ....h:22:void kernel_test3(Data*)
========= Current Value : 13, Incoming Value : 0

  1. The first thing that surprised me is the thread ids. When I first encountered the error, each block contained 32 threads (ids 0 to 31). So why is there a problem with the thread id 32? I even added an extra check on threadIdx.x, but this changed nothing.
  2. I use shared memory as a temporary buffer, and each thread deals with its own parameters of a multidimensional array, e.g. __shared__ float arr[SIZE_1][SIZE_2][NB_THREADS_PER_BLOCK]. I do not really understand how there could be any race conditions, since each thread deals with its own part of shared memory.
  3. Reducing the grid size from 64 blocks to 32 blocks seemed to solve the issue (with 32 threads per block). I do not understand why.

In order to understand what was happening, I tested with some simpler kernels. Let me show you an example of a kernel that creates that kind of error. Basically, this kernel uses SIZE_X*SIZE_Y*NTHREADS*sizeof(float) B of shared memory, and I can use 48KB of shared memory per SM.

test.cu

template <unsigned int NTHREADS>
__global__ void kernel_test()
{
    const int SIZE_X = 4;
    const int SIZE_Y = 4;

    __shared__ float tmp[SIZE_X][SIZE_Y][NTHREADS];

    for (unsigned int i = 0; i < SIZE_X; i++)
        for (unsigned int j = 0; j < SIZE_Y; j++)
            tmp[i][j][threadIdx.x] = threadIdx.x;
}

int main()
{
  const unsigned int NTHREADS = 32;

  //kernel_test<NTHREADS><<<32, NTHREADS>>>(); // ---> works fine
  kernel_test<NTHREADS><<<64, NTHREADS>>>();

  cudaDeviceSynchronize(); // ---> gives racecheck errors if NBLOCKS > 32
}

Compilation:

nvcc test.cu --ptxas-options=-v -o test

If we run the kernel:

cuda-memcheck --tool racecheck test

  • kernel_test<32><<<32, 32>>>(); : 32 blocks, 32 threads => does not lead to any apparent racecheck error.
  • kernel_test<32><<<64, 32>>>(); : 64 blocks, 32 threads => leads to WAW hazards (threadId.x = 32?!) and errors.

========= ERROR: Potential WAW hazard detected at shared 0x6 in block (57, 0, 0) :
========= Write Thread (0, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Write Thread (1, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Current Value : 0, Incoming Value : 128

========= INFO:(Identical data being written) Potential WAW hazard detected at shared 0x0 in block (47, 0, 0) :
========= Write Thread (32, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Write Thread (0, 0, 0) at 0x00000048 in ....h:403:void kernel_test(void)
========= Current Value : 0, Incoming Value : 0

So what am I missing here? Am I doing something wrong with shared memory? (I am still a beginner with this)

** UPDATE **

The problem seems to be coming from cudaDeviceSynchronize() when NBLOCKS > 32. Why is this happening?

2
Hi BenC, could you post a complete repro case ? Also, what GPU are you running this on and what are the CUDA driver and toolkit versions?Vyas
Hi Vyas, normally the minimal code I provided is enough to reproduce the problem. What information are you missing? I have a Geforce GT 650M, CUDA 5.0, CC 3.0 and drivers 304.64.BenC
Hi BenC, could you post the entire host code as well as the device code, and the exact build line you used. It makes it much easier to reproduce the problem.Vyas
@marina.k : NTHREADS is the number of threads per block, here it was 32.BenC
@Vyas: I updated my post, and I found that this is actually caused by cudaDeviceSynchronize() for a certain number of blocks.BenC

2 Answers

2
votes

For starters, the cudaDeviceSynchronize() isn't the cause; your kernel is the cause, but it's an asynchronous call, so the error is caught on your call to cudaDeviceSynchronize().

As for kernel, your shared memory is of size SIZE_X*SIZE_Y*NTHREADS (which in the example translates to 512 elements per block). In your nested loops you index into it using [i*blockDim.x*SIZE_Y + j*blockDim.x + threadIdx.x] -- this is where your problem is.

To be more specific, your i and j values will range from [0, 4), your threadIdx.x from [0, 32), and your SIZE_{X | Y} values are 4. When blockDim.x is 64, your maximum index used in the loop will be 991 (from 3*64*4 + 3*64 + 31). When your blockDim.x is 32, your maximum index will be 511.

Based on your code, you should get errors whenever your NBLOCKS exceeds your NTHREADS

NOTE: I originally posted this to https://devtalk.nvidia.com/default/topic/527292/cuda-programming-and-performance/cuda-racecheck-shared-memory-array-and-cudadevicesynchronize-/

-1
votes

This was apparently a bug in NVIDIA drivers for Linux. The bug disappeared after the 313.18 release.