How are global and local work sizes distributed in this function?

Question

This is from a sample program for OpenCL programming. I am confused about how global and local work size are computed. They are computed based on the image size.

Image size is 1920 x 1080 (w x h).

What I assumed is global_work_size[0] and global_work_size[1] are grids on image.

But now global_work_size is {128, 1088}.

Then local_work_size[0] and local_work_size[1] are grids on global_work_size. local_work_size is {128, 32}.

But total groups, num_groups = 34, it is not 128 x 1088.

Max workgroup_size available at device is 4096.

How is the image distributed into such global and local work group sizes?

They are calculated in the following function.

    clGetKernelWorkGroupInfo(histogram_rgba_unorm8, device, CL_KERNEL_WORK_GROUP_SIZE, sizeof(size_t), &workgroup_size, NULL);
    {
        size_t  gsize[2];
        int     w;

        if (workgroup_size <= 256)
        {
            gsize[0] = 16;//workgroup_size is formed into row & col
            gsize[1] = workgroup_size / 16;
        }
        else if (workgroup_size <= 1024)
        {
            gsize[0] = workgroup_size / 16;
            gsize[1] = 16;
        }
        else
        {
            gsize[0] = workgroup_size / 32;
            gsize[1] = 32;
        }

        local_work_size[0] = gsize[0];
        local_work_size[1] = gsize[1];

        w = (image_width + num_pixels_per_work_item - 1) / num_pixels_per_work_item;//to include all pixels, num_pixels_per_work_item is added first
        global_work_size[0] = ((w + gsize[0] - 1) / gsize[0]);//col
        global_work_size[1] = ((image_height + gsize[1] - 1) / gsize[1]);//row

        num_groups = global_work_size[0] * global_work_size[1];    
        global_work_size[0] *= gsize[0];
        global_work_size[1] *= gsize[1];
    }    
    err = clEnqueueNDRangeKernel(queue, histogram_rgba_unorm8, 2, NULL, global_work_size, local_work_size, 0, NULL, NULL);
    if (err)
    {
        printf("clEnqueueNDRangeKernel() failed for histogram_rgba_unorm8 kernel. (%d)\n", err);
        return EXIT_FAILURE;
    }

pmdj pmdj · Accepted Answer · 2020-06-13T12:43:07

I don't see any great mystery here. If you follow the calculation, the values do indeed end up as you say. (Not that the group size is particularly efficient in my opinion.)

If workgroup_size is indeed 4096, gsize will end up as { 128, 32 } as it follows the else logic. (>1024)
w is the number of num_pixels_per_work_item = 32 wide columns, or the minimum number of work-items to cover the entire width, which for an image width of 1920 is 60. In other words, we require an absolute minimum of 60 x 1080 work-items to cover the entire image.
Next, the number of group columns and rows is calculated and temporarily stored in global_work_size. As group width has been set to 128, a w of 60 means we end up with 1 column of groups. (This seems a waste of resources, more than half of the 128 work-items in each group will not be doing anything.) The number of group rows is simply image_height divided by gsize[1] (32) and rounding up. (33.75 -> 34)
Total number of groups can now be determined by multiplying out the grid: num_groups = global_work_size[0] * global_work_size[1]
To get the true total number of work-items in each dimension, each dimension of global_work_size is now multiplied by the group size in this dimension. 1, 34 multiplied by 128, 32 yields 128, 1088.

This actually covers an area of 4096 x 1088 pixels so about 53% of that is wastage. This is mainly because the algorithm for group dimensions favours wide groups, and each work-item works on a 32x1 pixel slice of the image. It would be better to favour tall work groups to reduce the amount of rounding.

For example, if we reverse gsize[0] and gsize[1], in this case we'd get a group size of { 32, 128 }, giving us a global work size of { 64, 1152 } and only 12% wastage. It would also be worth checking if always picking the largest possible group size is even a good idea; it quite possibly isn't, but I've not looked into the kernel's computation in detail, let alone run any measurements, to say if that's the case or not.

How are global and local work sizes distributed in this function?

1 Answers