Rendering Terrain Dynamically with Argument Buffers : Understanding why the particle buffer is not overwritten by the GPU inflight

Question

I am looking through an Apple demo project that is associated with the 2017 WWDC video entitled "Introducing Metal 2" where the developers demonstrate the use of argument buffers. The project is linked here on the page titled "Rendering Terrain Dynamically with Argument Buffers" on the Apple developer website. Here, they synchronize resource writes by the CPU to prevent race conditions with a dispatch_semaphore_t, signaling it when the command buffer finishes executing on the GPU and waiting on it if the CPU is writing data several frames ahead of the GPU. This is consistent with what was shown in a previous 2014 WWDC "Working With Metal: Fundamentals".

I noticed that it seems the APPLParticleRenderer is sending data to be written by the GPU in a compute pass before it finishes reading from that same buffer from the fragment shader from a previous render pass. The resource storage mode of the buffer is MTLResourceStorageModePrivate. My question: does Metal automatically synchronize access to private id<MTLBuffer>s accessible only by the GPU? Do render, compute, and blit passes called from new id<MTLCommandEncoder> have access to the buffer only after other passes have written and read from it (exclusive access)? I have seen that there are guaranteed barriers within tile shaders, where tile memory is accessed exclusively by the kernel before subsequent fragment shaders access the memory.

Lastly, in the 2016 WWDC "What's New in Metal, Part 2", the first presenter, Charles Brissart, at 16:44 mentions that fragment and vertex functions reading and writing from the same buffer must be placed into two render command encoders, but for compute kernels one compute command encoder suffices. This is consistent with what is seen within the particle renderer.

And right after commit _frameAllocator->switchToNextBufferInRing(); And spawn in APPLParticleRenderer swaps parameters under each other for its kernel. — Ol Sen
Yes I did see that. I also probably should have specified objective-c++ (I'm a Swift native and haven't worked on much objective-c). However, it appears that the _frameAllocator allocates only _uniforms_gpu and switches the buffer referenced there, writing into it each frame. There is a private MTLBuffer object managed by the particle renderer _particleDataPool (with MTLResourceStorageModePrivate) that is reused each frame which seems to be overwritten each frame while the GPU is still reading from it. — Left as an exercise
@OlSen It is possible I found my answer. According to the Apple documentation, a MTLResource has a default hazard tracking mode (hazardTrackingMode) of MTLHazardTrackingModeTracked (or MTLHazardTrackingMode.tracked in Swift). This means Metal tracks dependencies across commands that modify the resource, as is the case with the particle kernel, and delays execution until prior commands accessing the resource are complete. Hence, a triple buffering system is not needed for a resource accessed exclusively by the GPU (if it can be written to by the CPU, that is different) — Left as an exercise
@OlSen I added more details after further researching Metal synchronization if you're interested — Left as an exercise

Left as an exercise Left as an exercise · Accepted Answer · 2020-11-23T02:00:25

See my comment on the original question for a brief version of this answer.

It turns out that Metal tracks dependencies between commands scheduled to the GPU by default for MTLResource types. The hazardTrackingMode property of a MTLResource is defaulted to MTLHazardTrackingModeTracked (MTLHazardTrackingMode.tracked in Swift) according to the Metal documentation. This means Metal tracks dependencies across commands that modify the resource, as is the case with the particle kernel, and delays execution until prior commands accessing the resource are complete. Therefore, since the _particleDataPool buffer has a storage mode of MTLResourceStorageModePrivate (storageModePrivate in Swift), it can only be written to by the GPU; hence, no CPU/GPU synchronization is necessary with a semaphore for this buffer and thus no multi-buffer system is necessary for the resource. Only when a resource can be written to by the CPU while the GPU is still reading from it do we want multiple buffers so the CPU is not idle.

Note that the default hazard tracking mode for a MTLHeap is MTLHazardTrackingModeUntracked (MTLHazardTrackingMode.untracked in Swift), in which case you are responsible for synchronizing resource writes by the GPU

EDIT

After reading into resource synchronization in Metal, there are some additional points I would like to make that I think further clarify what's going on. Note that the remaining portion is in Swift. To learn more in detail, I recommend reading the "Synchronization" section in the Metal documentation here.

MTLFence

Firstly, a MTLFence is used to synchronize accesses to untracked resources within the execution of a single command buffer. A fence gives you explicit control over when the GPU accesses resources and is necessary when you are working with an untracked resource. Otherwise, Metal will handle this synchronization for you

It is important to note that the automatic management I mention in the answer only occurs within a single command buffer between encoding passes. But this does not mean we need to synchronize across command buffers scheduled in the same command queue since a command buffer is not immediately scheduled for execution. In fact, according to the documentation on the addScheduledHandler(_:) method of the MTLCommandBuffer protocol found here

The device object schedules the command buffer after it identifies any dependencies with work tasks submitted by other command buffers or other APIs in the system.

at which point it would be safe to access these same buffers. Note that within a single render encoding pass, it is important to mention that if a vertex shader writes into a buffer the fragment shader in the same pass reads from, this is undefined. I mentioned this in the original question, the solution being to use two render pass encoders. I have yet to determine why this is not necessary for a compute encoder, but I imagine it has to do with how kernels are executed in comparison to vertex and fragment shaders

MTLEvent

In some cases, however, command buffers in different queues created by the same MTLDevice need access to the same resource or depend on one another in some way. In this case, synchronization is necessary because the separate queues schedule their own command buffers without knowledge of the other, meaning there is potential for the two command buffers to be executing concurrently.

To fix this problem, you use an MTLEvent instance created by the device using makeEvent() and encode event signals at specific points in each buffer.

MTLSharedEvent

In the event (no pun intended) that you have multiple processors (different CPU cores, CPU and GPU, or multi-GPU), resource synchronization is needed. Here, you create a MTLSharedEvent in place of a MTLEvent that can be used to synchronize across devices and processes. It is essentially the same API as that of the MTLEvent, but involves command queues on different devices.

Rendering Terrain Dynamically with Argument Buffers : Understanding why the particle buffer is not overwritten by the GPU inflight

1 Answers