iOS how to work with simd_float3 correctly?

Question

I'm running into a weird issue, working with data created by Metal shader code. I want to do additional processing in Swift, and then inject the data back to be rendered by Metal.

Metal shader is given a MetalBuffer
Metal shader processes vertices one by one
Metal shader writes the result to the provided buffer
Metal shader executes it's completion block for buffer
Swift code receives the buffer and iterates over it, rewriting one struct into another
Swift code writes the result to the shared buffer used for rendering
Metal shader renders pixels at slightly wrong positions, causing flicker and graphical glitches

All structs and math is using simd_float3

How do I correctly pass struct data from Metal code to be available in Swift and pass it back? What I'm seeing is that my values get corrupted as a result of processing (some kind of floating point error - pixels get mis-aligned when drawn using the new struct).

If I assign the value directly from one buffer to the shared buffer, the points are rendered at correct locations, and no flicker happens.

//ShaderTypes.h
#include <simd/simd.h>
struct ParticleUniforms {
    simd_float3 position;
    simd_float3 color;
//    float confidence; // I want to get rid of this value after processing in swift
};

struct DisplayPoint {
    simd_float3 position;
    simd_float3 color;
};


/// Vertex shader that takes in a 2D grid-point and infers its 3D position in world-space, along with RGB and confidence.
/// Updates the passed in particleUniforms buffer, one point at a time
vertex void screenSample(uint vertexID [[vertex_id]],
                         constant PointCloudUniforms &uniforms [[buffer(kPointCloudUniforms)]],
                         device ParticleUniforms *particleUniforms [[buffer(kParticleUniforms)]],
                         constant float2 *cameraSamplePatternBuffer [[buffer(kGridPoints)]],
                         texture2d<float, access::sample> capturedImageTextureY [[texture(kTextureY)]],
                         texture2d<float, access::sample> capturedImageTextureCbCr [[texture(kTextureCbCr)]],
                         texture2d<float, access::sample> depthMap [[texture(kTextureDepth)]],
                         texture2d<unsigned int, access::sample> confidenceMap [[texture(kTextureConfidence)]]) {
    
    //this is what combines and parses the data to create each point on the cloud
    const auto sampledCameraPoint = cameraSamplePatternBuffer[vertexID];
    
    const auto texCoord = sampledCameraPoint / uniforms.cameraResolution;

    // Sample the color depth map to get the depth value
    const auto depth = depthMap.sample(colorSampler, texCoord).r;
    // With a 2D point plus depth, we can now get its 3D position
    const auto position = worldPoint(sampledCameraPoint, depth, uniforms.cameraIntrinsicsInversed, uniforms.localToWorld);
    
    // Sample Y and CbCr textures to get the YCbCr color at the given texture coordinate
    const auto ycbcr = float4(capturedImageTextureY.sample(colorSampler, texCoord).r, capturedImageTextureCbCr.sample(colorSampler, texCoord.xy).rg, 1);
    const auto sampledColor = (yCbCrToRGB * ycbcr).rgb;
    // Sample the confidence map to get the confidence value
    const auto confidence = confidenceMap.sample(colorSampler, texCoord).r;
    
    if (confidence > 1.0) {
        particleUniforms[vertexID].position = position.xyz;
        particleUniforms[vertexID].color = sampledColor;
    }
}

Here's the shader which is actually drawing points on the screen:

vertex ParticleVertexOut particleVertex(uint vertexID [[vertex_id]],
                                        constant PointCloudUniforms &uniforms [[buffer(kPointCloudUniforms)]],
                                        constant ParticleUniforms *particleUniforms [[buffer(kParticleUniforms)]]) {
    // get point data
    const auto particleData = particleUniforms[vertexID];
    const auto position = particleData.position;
    
    // animate and project the point
    float4 projectedPosition = uniforms.viewProjectionMatrix * float4(position, 1.0);
    
    projectedPosition /= projectedPosition.w;
    
    ParticleVertexOut out;
    out.position = projectedPosition;
    
    const auto sampledColor = particleData.color;
    
    out.color = float4(sampledColor, 1);
    out.pointSize = 1.0 ;
    
    return out;
}

in swift (both conversions produce corruption of the position, and color).

let particle: ParticleUniforms = particlesBuffer[pointIndex]
let point = DisplayPoint(position: simd_float3(particle.position.x,
                                               particle.position.y,
                                               particle.position.z),
                         color: simd_float3(particle.color.x,
                                            particle.color.y,
                                            particle.color.z))
// -OR -
let point = DisplayPoint(position: particle.position,
                         color: particle.color))

The image corruption below - when passing the original struct from one shader into another shader, all pixels align properly and there are no "holes":

If you email me your email (mine is on my about page), will invite you to one or two Metal-specific Slack groups. — David H
What does "Metal shader processes vertices one by one"? Is it a compute shader or a vertex one? Because with a vertex shader there's no guarantees on the write orders. Also, this whole timeline that you described isn't going to work without events. When metal is executing your completedHandler it's not waiting on the GPU timeline for the callback to finish. So by the time you are writing back to the buffer, GPU is already executing next command buffer. — JustSomeGuy
I put the vertex shader code in the question. I don't know what a compute shader is. The sample code from Apple pretends to draw these vertices, but really just uses that shader function to write values to a shared buffer which is then iterated over by another shader. — Alex Stone
I have replaced the dummy vertex function by a kernel function, and it seems to work the same, without having to do a fake rendering pass ! — Alex Stone

JustSomeGuy JustSomeGuy · Accepted Answer · 2021-02-10T17:50:55

If I understood correctly what you are trying to do, you basically want to stall GPU timeline until you've modified intermediate result from the GPU on the CPU.

Basically, unless there's explicit synchronization, after command buffer is scheduled, CPU and GPU timelines are completely independent from your program's point of view, which means, that you can't make any assumptions on what executes first between CPU and GPU.

If you try to just do it in the completion handler of your command buffer, this isn't going to work. GPU will continue executing the next command buffer in the queue and you are basically gonna be racing to change the data on the CPU before GPU gets to it.

To do what you are trying to do properly, you need to use MTLSharedEvent and instead of using a completion block, you are gonna use event's notifyListener.

Your new timeline is gonna look like this, I'm going to use Objective-C here:

GPU Timeline:

Metal shader is given a MetalBuffer
Metal shader processes vertices one by one
Metal shader writes the result to the provided buffer
[commandEncoder encodeSignalEvent:event value:n]
[commandEncoder encodeWaitForEvent:event value:n+1]

On CPU:

Subscribe to event: [event notifyListener:_sharedEventListener atValue:2 block:{your block}

And in your block you are going to do the following:

Block receives the buffer and iterates over it, rewriting one struct into another
Block code writes the result to the shared buffer used for rendering
event.signaledValue = n + 1

n in this case should be a monotonically rising value. So every time you want to progress forward you bump it up by one. In this case since we signal both n and n + 1, you would need to bump it up by two. It all depends on your particular use case, just make sure that you aren't waiting on a value that's lower than the current signaledValue of an event

And this should work. For more info, refer to these articles from Apple: MTLSharedEvent and Synchronizing Events Between a GPU and the CPU

But you should also note, that if you aren't using multiple command queues, your GPU will basically stall, do nothing, while it waits for CPU to finish whatever modifications it's doing. Sometimes you can't avoid it, but it's usually best to try to move all the work you can to the GPU timeline avoid GPU stalling. You can try to schedule the work on the same queue to modify your vertices or use multiple queues and use techniques from this article: Synchronizing Events Within a Single Device.

Anyways, I hope this helps you solve your issue.

iOS how to work with simd_float3 correctly?

1 Answers