CUDA: 2 threads from different warps but same block attempt to write into same SHARED memory position: dangerous?

Question

Will this lead to inconsistencies in shared memory?

My kernel code looks like this (pseudocode):

__shared__ uint histogram[32][64];

uint threadLane = threadIdx.x % 32;

for (data){
     histogram[threadLane][data]++;
}

Will this lead to collisions, given that, in a block with 64 threads, threads with id "x" and "(x + 32)" will very often write into the same position in the matrix?

This program calculates a histogram for a given matrix. I have an analogous CPU program which does the same. The histogram calculated by the GPU is consistently 1/128 lower than the one calculated by the CPU, and I can't figure out why.

Can you please provide some more details, especially about what data is in relation to threadIdx and about the launch configuration? Something that compiles would be better. — Davide Spataro

Davide Spataro Davide Spataro · Accepted Answer · 2017-07-12T17:19:52

It is dangerous. It leads to race conditions.

If you cannot guarantee that each thread within a block has unique write access to a location in the shared memory then you have a problem because that you need to solve by synchronization.

Take a look at this paper for a correct and efficient way of using SM for histogram computation: http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/histogram64/doc/histogram.pdf

Note that is plenty of libraries online that allows you to compute histograms in one line, Thrust for instance .

CUDA: 2 threads from different warps but same block attempt to write into same SHARED memory position: dangerous?

1 Answers