12
votes

OpenCL standard defines the following options to get info about device and compiled kernel:

  • CL_DEVICE_MAX_COMPUTE_UNITS

  • CL_DEVICE_MAX_WORK_GROUP_SIZE

  • CL_KERNEL_WORK_GROUP_SIZE

  • CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE

Given this values, how can I calculate the optimal size of work group and number of work groups?

2

2 Answers

8
votes

You discover these values experimentally for your algorithm. Use a profiler to get hard numbers.

I like to use CL_DEVICE_MAX_COMPUTE_UNITS as the number of work groups, because I often rely on synchronizing work items. I usually run kernels with little branching, so the take the same time to execute in each compute unit.

Some multiple of CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE will be optimal for your device. What that multiple actually is depends on your memory access pattern and type of work you are doing with each work item. Use 1 as the multiple when you are running a heavy, compute-bound (ALU) kernel. Try a larger multiple to hide memory latency if you are bottlenecked by memory access. Use a profiler to determine when your access time and your ALU time are optimal.

Optimal ratio for ALU to fetch is 1:1 for any device. This is rarely achieved in practice, so you want to keep the ALU/SIMD banks saturated. This means ALU:fetch should be greater than 1 whenever possible. Less than 1 means you should try a larger work group size to better hide the memory latency.

0
votes

As mfa said, you have to discover these experimentally. I wanted to add that depending on what you are computing (particularly size of the jobs, i.e. smaller or larger for each work item), sometimes a good try can be:

  • Lots of work items with small work groups and each job item being small.
  • Less work items with larger work groups and each job item being larger.

That is, basically check base cases and figure out how it affects the processing pipeline.

In essence you have to tweak it. I often execute several times for different parameters (profile it) and then generate a surface plot to see how it behaves.