Does it make sense to "preload" data to a shared storage of a compute shader for faster read access?

Question

I have the following compute shader:

#version 450

layout (local_size_x = 128, local_size_y = 1, local_size_z = 1) in;

layout(push_constant) uniform PushConstant
{
    vec2 topLeft;
    vec2 bottomRight;
};

struct Position {
  float x, y, z;
};

layout (set=0, binding=0) buffer PositionBuffer
{
    Position positions[];
};

layout (set=0, binding=1) buffer SelectionBuffer
{
    uint selected[];
};

void main()
{
    uint ind = gl_GlobalInvocationID.z * (gl_WorkGroupSize.x * gl_NumWorkGroups.x) * (gl_WorkGroupSize.y * gl_NumWorkGroups.y)
               + gl_GlobalInvocationID.y * (gl_WorkGroupSize.x * gl_NumWorkGroups.x)
               + gl_GlobalInvocationID.x;

    Position pos = positions[ind];

    selected[ind] = 0;

    if(pos.x > topLeft.x && pos.x < bottomRight.x && pos.y > topLeft.y && pos.y < bottomRight.y)
    {
        selected[ind] = 1;
    }
}

What it does is checks if a point (from positions buffer) is inside a user-provided rectangle (from a PushConstant). If it is - the shader marks the point by writing 1 to a selected buffer.

This code works fine. But since I'm not experienced with a compute I'm searching for ways to make it better. I know there're shared variables which are accessed by the entire group. The idea is to make an array of shared positions and fill it in one thread, let's say, thread number 0. Then, theoretically, other threads need not read buffer memory, but the faster shared memory instead.

Does it worth it?
How to synchronize properly?
Can I do something similar for writing data to selected array?

You could massively simplify the address calculation here. You want a 1D address in a list, so why not just use a 1D work space and avoid all of the maths needed to compute ind? On some GPU architectures you might also get some benefit from using the any() or all() built-in functions to implement vector comparisons rather than the long scalar if you have at the moment. Also your positions are vec3 types but you are only using x and y. Given this is likely memory bandwidth limited, getting rid of z from the buffer would help .. — solidpixel
Thank you. I simplified the condition check. But can you give me a hint how can I simplify the index computation? Indeed, I need a 1D address, but since the number of work groups is a 3D value, I have no idea how I can shorten it. — nikitablack
The application controls workgroup dispatch size. Just set up the dispatch so that the global y and z dimensions are 1. — solidpixel

Nicol Bolas Nicol Bolas · Accepted Answer · 2019-02-13T18:36:25

Look at it from the perspective of your total operation. In order, you:

Read a single, contiguous block of memory.
Perform a single operation on each value of that memory.
Write the result of that operation into another block of memory.

At no time does your code need to read the value more than once. And while the code as written does potentially write a value twice, there's no reason it has to. You could just as easily compute a value based on the condition, and then write that value to memory. And I will assume a good compiler will translate your code into precisely that.

Because no threads are reading from or writing to more than one location at once, cached access to the memory only helps in that it allows turning "read X bytes" into a more efficient "read cacheline bytes" read. Two invocations who try to read from addresses that happen to live in the same cache line should only perform a single memory fetch. The same goes for writing; multiple invocations writing to the same cache line should be aggregated into a single write.

This assumes reasonable hardware, of course.

It remains hypothetically possible for such a system to invoke multiple reads/writes of the same memory. That has to do with the number of invocations in a warp/wavefront (ie: the number of invocations of a shader that executes in lock-step). If the size of the data being read per warp is not cache aligned, it is possible for two warps to issue a read to the same cache line, since the different warps are potentially executed at the same time. The same goes for writes. But even this assumes that the cache and decision to perform a memory fetch is made on a per-warp basis.

Regardless, if this should be determined to be the case, the proper solution to that is to align your reading as best as you can, not to try to do the cache's job for it.

There are times when pre-caching data would be useful, but this would mainly be in cases where invocations frequently are reading from the same addresses, and usually when they're reading from each others' memory. Even then, that's something you should profile, rather than trying to code that way a priori.

Does it make sense to "preload" data to a shared storage of a compute shader for faster read access?

1 Answers