Optimize metal compute shader for image histogram

Question

I have a metal shader that computes an image histogram like this:

#define CHANNEL_SIZE (256)
typedef atomic_uint HistoBuffer[CHANNEL_SIZE];

kernel void
computeHisto(texture2d<half, access::read> sourceTexture [[ texture(0) ]],
             device HistoBuffer &histo [[buffer(0)]],
             uint2  grid [[thread_position_in_grid]]) {
  if (grid.x >= sourceTexture.get_width() || grid.y >= sourceTexture.get_height()) { return; }

  half gray  = sourceTexture.read(grid).r;
  uint grayvalue = uint(gray * (CHANNEL_SIZE - 1));
  atomic_fetch_add_explicit(&histo[grayvalue], 1, memory_order_relaxed);
}

This works as expected but takes too long (>1ms). I now tried to optimise this by reducing the number of atomic operations. I came up with the following improved code. The idea is to compute local histograms per thread group and add them later atomically into the global hist buffer.

kernel void
computeHisto_fast(texture2d<half, access::read> sourceTexture [[ texture(0) ]],
             device HistoBuffer &histo [[buffer(0)]],
             uint2 t_pos_grid [[thread_position_in_grid]],
             uint2 tg_pos_grid [[ threadgroup_position_in_grid ]],
             uint2 t_pos_tg [[ thread_position_in_threadgroup]],
             uint  t_idx_tg [[ thread_index_in_threadgroup ]],
             uint2 t_per_tg [[ threads_per_threadgroup ]]
             )
{
  threadgroup uint localhisto[CHANNEL_SIZE] = { 0 };

  if (t_pos_grid.x >= sourceTexture.get_width() || t_pos_grid.y >= sourceTexture.get_height()) { return; }

  half gray  = sourceTexture.read(t_pos_grid).r;
  uint grayvalue = uint(gray * (CHANNEL_SIZE - 1));
  localhisto[grayvalue]++;

  // wait for all threads in threadgroup to finish
  threadgroup_barrier(mem_flags::mem_none);

  // copy the thread group result atomically into global histo buffer
  if(t_idx_tg == 0) {
    for(uint i=0;i<CHANNEL_SIZE;i++) {
      atomic_fetch_add_explicit(&histo[i], localhisto[i], memory_order_relaxed);
    }
  }
}

There are 2 problems:

The improved routine does not yield identical results compared to the first and I currently don't see why ?
The run time didn't improve. in fact it takes 4 times the runtime of the unoptimised version. According to the debugger the for loop is the problem. But I do not understand this, since the number of atomic operation is reduced by 3 orders of magnitude, i.e. the thread group size, here (32x32)=1024.

Anbody who can explain what I am doing wrong here ? Thanks

EDIT: 2019-12-22: According to Matthijs answer I have changed the local histogram also to atomic operations like this:

threadgroup atomic_uint localhisto[CHANNEL_SIZE] = {0};

half gray  = sourceTexture.read(t_pos_grid).r;
uint grayvalue = uint(gray * (CHANNEL_SIZE - 1));
atomic_fetch_add_explicit(&localhisto[grayvalue], 1, memory_order_relaxed);

However the result sill is not the same as in the reference implementation above. There must be another severe conceptional bug ???

Matthijs Hollemans Matthijs Hollemans · Accepted Answer · 2019-12-22T12:58:24

You'll still need to use atomic operations on the threadgroup memory, since it's still being shared by multiple threads. This should be faster than in your first version because there is less contention for the same locks.

Optimize metal compute shader for image histogram

1 Answers