1
votes

I apologize in advance for the vagueness of this question.

Background:

I am attempting to write a morphological image processing function in OpenCL. I have a __local buffer which I use to store data for every pixel (each pixel is represented by a work-item, no loop unrolling yet). Also, since I am early in testing, I am only using a single work-group (8x8 pixel image so I can manually validate results).

Problem:

There are occasions when data from one, two, three, or even four pixels must be added into the pixel buffer of another. Since these are adjacent pixel in the same workgroup, I am sure I am causing local memory bank conflicts. That's ok, speed isn't my top priority (yet!). However, these bank conflicts seem to be dropping data and even corrupting data. I've been very careful not to overflow or over run the buffers.

So, my first question is: is it, in fact, possible that the the bank conflicts are causing data corruption and loss? The Opencl spec seems to indicate that the operation should serialize, slowing down the bandwidth - but there is no mention of data loss.

My second question is: Help! - What can I do about this?

Any guidance will be greatly appreciated - thanks!

1
Just to let you know that I am having the same issue with an NVIDIA card... It seems the implementation of shared memory is very finicky about timing issues and the compiler is not able to work around them, apparently. I also had to remove a branch in the loop and unroll the loop to get things to compute correctly. And it got faster without the branch and some extra dummy computation anyway ;)Samuel Audet
Can you share the code? How are the adds performed? Are you using atomic ops where necessary. Are you synchronizing memory accesses with barriers? It's doubtful bank conflicts would cause data corruption. It's more like to be inappropriate memory access patterns. But we can only guess without the code.Tim Child

1 Answers

0
votes

maybe the nvidia whitepaper Prefix Sum (Scan) with CUDA can bring you on the right track. It is about the all-prefix-sums algorithm, which is a good example of a computation that seems inherently sequential, but for which there is an efficient parallel algorithm.

The all-prefix-sums operation turns lists of numbers [3,4,1,2] into their sums: [0,3,7,8].

I know the paper is about CUDA, but I found that the resulting kernels are very similar as both tchnologies use similar concepts.

I hope, the paper can help you.

Cheers