OpenCL oclHistogram from Nvidia on AMD HD5830

Question

I'm trying to understand why the oclHistogram example with 256 bins from the Nvidia OpenCL SDK don't work on my HD5830 from AMD. On a Nvidia card such as the GTX 580 there is no problem to run (the 64 bin example works on the AMD card, too). More informations can be found in the OpenCL whitepaper and the CUDA whitepaper. This example was also discussed here, the last post there is from me, but I didn't got an answer, yet.

What I know: on Nvidia cards we have 16 KB local memory, on the AMD it is only 8 KB. So the histogram calculation should fit for both cards:

6 warps (192 threads) * 256 counters * 4 bytes per counter == 6KB

On the GTX580 we also could use 16 Warps and 16 Sub-Histograms, but the Sub-Histograms must be merged at the end and this is costly. So using only 6 warps is faster than using 16 warps.

There must be another limit on the AMD HD5830 card, because it doesn't work with it.

Try to decrease workgroup size from 256 to 64, this helped me to run nBody sample. — Dmitriy

Rick-Rainer Ludwig Rick-Rainer Ludwig · Accepted Answer · 2011-06-27T10:21:10

In an introduction to OpenCL (I do not remember which :-() it was mentioned not to plan to use the whole memory. Some additional data might be placed into the memory which are out of control of the programmer. The compiler might use the local memory for optimizations and other kernels might allocate memory, too. It was generally suggested not to use more than the half of the memory. Maybe this is something which happens here.

OpenCL oclHistogram from Nvidia on AMD HD5830

1 Answers