I'm trying to understand why the oclHistogram example with 256 bins from the Nvidia OpenCL SDK don't work on my HD5830 from AMD. On a Nvidia card such as the GTX 580 there is no problem to run (the 64 bin example works on the AMD card, too). More informations can be found in the OpenCL whitepaper and the CUDA whitepaper. This example was also discussed here, the last post there is from me, but I didn't got an answer, yet.
What I know: on Nvidia cards we have 16 KB local memory, on the AMD it is only 8 KB. So the histogram calculation should fit for both cards:
6 warps (192 threads) * 256 counters * 4 bytes per counter == 6KB
On the GTX580 we also could use 16 Warps and 16 Sub-Histograms, but the Sub-Histograms must be merged at the end and this is costly. So using only 6 warps is faster than using 16 warps.
There must be another limit on the AMD HD5830 card, because it doesn't work with it.