Metal fragment shader A-Buffer produces shimmering glitch

Question

I'm implementing an A-Buffer in Metal for Mac, and it is almost working -- except that I am seeing shimmering glitches wherever triangles overlap. It seems like the buffers involved may not be updating at the correct times. But I don't know what could cause it. Here's a picture -- the 'corrupted' area changes every frame but is always where the two colors overlap.

I won't explain the whole A-Buffer operation, but it involves binding three buffers to the shader: one is very large (172MB, although only a small part of it is written to for this example). There is also a "texture" of integers and a single integer atomic counter.

The rendering is done in two passes -- the first pass creates a linked-list of pixel fragments for every visible rendered pixel location:

// the uint return goes into the start index buffer, our 'image'.  The FragLinkBuffer stores the data

fragment uint stroke_abuffer_fragment(VertexIn interpolated [[stage_in]],
                                                const device uint&  color [[ buffer(0) ]],
                                                device FragLink*  LinkBuffer [[ buffer(1) ]],
                                                device atomic_uint &counter[[buffer(2)]],
                                                texture2d<uint> StartTexture     [[ texture(0) ]]) {
    constexpr sampler Sampler(coord::pixel,filter::nearest);

    // get old start position for this pixel from from start buffer
    uint value = atomic_fetch_add_explicit(&counter, 1, memory_order_relaxed);

    // store pointer to this position in the start buffer
    int oldStart = StartTexture.sample(Sampler, interpolated.position.xy).x;

    // store fragment information in link buffer
    FragLink F;
    F.color = color;
    F.depth = interpolated.position.z;
    F.next = oldStart;
    LinkBuffer[value] = F;

    // return pointer to new start for this fragment, which will be stored back to the StartTexture
    return value;  
}

The second pass sorts and blends the fragments at each pixel.

#define MAX_PIXELS 16

fragment float4 stroke_abuffer_fragment_composite(CompositeVertexOut interpolated [[stage_in]],
                                                  device FragLink*  LinkBuffer [[ buffer(0) ]],
                                                  texture2d<uint> StartTexture     [[ texture(0) ]]) {
    pixel SortedPixels[MAX_PIXELS];
    int numPixels = 0;
    constexpr sampler Sampler(coord::pixel,filter::nearest);
    FragLink F;
    pixel P;

    uint index = StartTexture.sample(Sampler, interpolated.position.xy).x;
    if (index == 0)
        discard_fragment();

    float4 finalColor = float4(0.0);

    // grab all the linked fragments for this pixel
    while (index != 0) {
        F = LinkBuffer[index];
        P.color = F.color;
        P.depth = F.depth;
        SortedPixels[numPixels++] = P;
        index = (numPixels >= MAX_PIXELS) ? 0 : F.next;
    }

    // now sort them by depth
    for (int j = 1; j < numPixels; ++j) {
        pixel key = SortedPixels[j];
        int i = j - 1;
        while (i >= 0 && SortedPixels[i].depth <= key.depth)
        {
            SortedPixels[i+1] = SortedPixels[i];
            --i;
        }
        SortedPixels[i+1] = key;
    }

    // blend them in order
    for (int k = 0; k < numPixels; k++) {
        uint color = SortedPixels[k].color;
        float red = ((color>>24)&255)/255.0;
        float green = ((color>>16)&255)/255.0;
        float blue = ((color>>8)&255)/255.0;
        float alpha = ((color)&255)/255.0;
        //red = 1.0; green = 0.0; blue = 0.0; alpha = 0.25;
        finalColor.xyz = mix(finalColor.xyz, float3(red,green,blue), alpha);
        finalColor.w = alpha;
    }


    return finalColor;

}

I'm just wondering what might be the cause of this behavior. If I check the values of the buffers at each frame, by blitting their contents back to CPU memory and printing values, they are changing every frame, when they should be the same.

The results are the same whether or not I call commandBuffer.waitUntilCompleted() after each frame's call to commandBuffer.commit(). By calling waitUntilCompleted, shouldn't I eliminate any issues relating to one frame's use of the buffer while the next frame is also trying to access it? (Because I thought perhaps I would need to triple buffer that 172MB buffer which would be horrible.)

I'm doing the entire render -- the initial blit to reset the counter, the first rendering pass, and then the second rendering pass, all as one commandBuffer call. Would that be a problem? In other words, do I need to actually commit the first rendering pass, wait for it to complete, and then initiate the second? (EDIT: I tried this and it did not change anything)

The original technique I am porting (https://www.slideshare.net/hgruen/oit-and-indirect-illumination-using-dx11-linked-lists) does not use OpenGL blending in the second stage -- they bind the background as a texture buffer and blend it in manually along with the pixel fragments, then return the complete result. I just decided to skip this and blend my final combined fragment color with the background using normal 'over' blending. But I don't see why this would cause the problem I'm having. I will try it their way just in case...

I greatly appreciate any ideas about what would cause this! Thanks

. . .

UPDATE: Following the conversation in the comments I've updated the shaders to use an atomic buffer instead of a texture, but am now getting "Execution of the command buffer was aborted due to an error during execution. Internal Error (IOAF code 1)":

fragment void stroke_abuffer_fragment(VertexIn interpolated [[stage_in]],
                                      const device uint&  color [[ buffer(0) ]],
                                      constant Uniforms&  uniforms    [[ buffer(1) ]],
                                      device FragLink*  LinkBuffer [[ buffer(3) ]],
                                      device atomic_uint &counter[[buffer(2)]],
                                      device atomic_uint *StartBuffer[[buffer(4)]]
                                      ) {

    uint pos = int(interpolated.position.x)+int(interpolated.position.y)*uniforms.displaySize[0];

    // get counter value -- the index to next spot in link buffer
    uint value = atomic_fetch_add_explicit(&counter, 1, memory_order_relaxed);
    value += 1;

    // store fragment information in link buffer
    FragLink F;
    F.color = color;
    F.depth = interpolated.position.z;
    F.next = atomic_exchange_explicit(&StartBuffer[pos], value, memory_order_relaxed);

    LinkBuffer[value] = F;
}

Ken Thomases Ken Thomases · Accepted Answer · 2017-11-11T20:41:24

For the stroke_abuffer_fragment pass, are you using the same texture for the render target and the StartTexture parameter? I don't think that's kosher. I would hope that the validation layer would complain about that, but maybe it doesn't.

Probably, StartTexture should use access::read_write and the function should write the result to it and return void. In that case, there should be no render targets for the render command encoder.

You also need to declare it with the raster_order_group(0) qualifier to ensure that only one invocation of the fragment function for that pixel will run at a time.

You may need to call StartTexture.fence() after writing to it. I'm not certain about this because the next read of that same texel will be in a subsequent invocation of the fragment function (thanks to raster_order_group()). In other words, raster_order_group() seems to imply a fence, itself.

You'd also need to call textureBarrier on the command encoder after the draw call for that pass. That's necessary to ensure that the next pass sees the results written by the first pass. Other than that, though, it should be fine to do all of this in a single command buffer.

Update:

If you can't use raster_order_group() because you're targeting OS versions before High Sierra, there's an alternative. In fact, it might be superior even if you can because it doesn't require the synchronization implied by raster_order_group().

The basic idea is to use atomic exchange to manipulate the linked list.

So, would have to change StartTexture to a buffer rather than a texture (as you mentioned trying in your first comment). Yes, you'd need to pass in the width as a "uniform" and compute the element index as you indicated (x + y * width). You wouldn't try to keep using read(). Buffers don't have member functions like that. They are just references or, for this case, pointers. You just index into it like StartTexture[index].

The thing is, though, that you'd make the element type atomic_uint instead of uint. And you would use atomic exchange instead of normal reading or writing StartTexture to integrate the new node into the link list:

F.next = atomic_exchange_explicit(&StartTexture[index], value, memory_order_relaxed);

This maintains the integrity of the linked list even if two invocations of stroke_abuffer_fragment() are running for a given position at the same time.

Another thing: what are you initializing the counter buffer to? And what is StartBuffer cleared to? It seems like you're using a 0 value as a sentinel for end-of-list, so I'm guessing you reset both to all zeros. That makes sense, but remember that atomic_fetch_add_explicit() returns the value of the counter as it was before being incremented. So the first invocation of stroke_abuffer_fragment() will get 0. If you want the value after incrementing, you would, of course, add 1. If you don't want to waste an element in LinkBuffer, you can subtract that 1 back out when indexing into it. Or you could choose a different sentinel value and clear things appropriately. One way or another, you need to fix the mismatch.

Oh, by the way, the color parameter of stroke_abuffer_fragment() should probably be declared in the constant address space, not the device address space.

Metal fragment shader A-Buffer produces shimmering glitch

1 Answers