0
votes

I am a beginner to OpenCL. I am implementing an algorithm on AMD 8670M(GCN Architecture) device. I am using OpenCL local memory to store frequently accessed global data. According to the device specificatons there are :

a) 5 compute units each having 64 KB of local memory.So device as a whole has 320 KB.
b) Maximum 2560 work-items on a compute unit. 

I launched a kernel with 8 work-groups,each work-group having 256 work-items.Each work-group utilizes 16 KB of local memory. So the kernel uses :

a) 2048 work-items  
b) 128 KB local memory

2048 work-items fit on a single compute unit but a compute unit provides only 64 KB local memory.So,two compute units are required to provide required local memory.
According to my understanding now there can be two ways of kernel launching

1) Work-groups are distributed to two compute units to provide required local memory.
2) Work-groups are assigned to only one compute unit and excess local memory is spilled out to global memory.

Which of the above cases are likely to occur? Is there any way of checking number of active wave-fronts on each compute unit? Any suggestions are appreciated.Thanks in advance.

1
Local memory cannot spill to global memory. Instead, the number of concurrently running work groups will be limited.void_ptr

1 Answers

2
votes

Work groups do not have to be concurrent. Nor do they have to be on a single compute unit. Since you can fit 4 work groups only on a single compute unit you are guaranteed to not have all of them on the same compute unit at the same time (there will not be any spill, that would defeat the purpose of local memory).

Now the system is still free to start your 8 WGs on the 5 CUs or even on a single CU but one after the other. The only scheduling guarantee is that each 256 bundle of work items will be scheduled together. It is up to the system to pick something that is most efficient.

And here comes the kicker. You're running on a system that can run up to 12k work items concurrently. You're only providing it 2k work items. So the system may not end up working very efficiently since you're far from filling the machine. In particular you typically want multiple WGs per CU to help hide the latencies of starting and stopping them.