Will this lead to inconsistencies in shared memory?
My kernel code looks like this (pseudocode):
__shared__ uint histogram[32][64];
uint threadLane = threadIdx.x % 32;
for (data){
histogram[threadLane][data]++;
}
Will this lead to collisions, given that, in a block with 64 threads, threads with id "x" and "(x + 32)" will very often write into the same position in the matrix?
This program calculates a histogram for a given matrix. I have an analogous CPU program which does the same. The histogram calculated by the GPU is consistently 1/128 lower than the one calculated by the CPU, and I can't figure out why.
data
is in relation to threadIdx and about the launch configuration? Something that compiles would be better. – Davide Spataro