1
votes

I've been trying to implement a Compute Shader based particle system.

I have a compute shader which builds a structured buffer of particles, using a UAV with the D3D11_BUFFER_UAV_FLAG_COUNTER flag.

When I add to this buffer, I check if this particle has any complex behaviours, which I want to filter out and perform in a separate compute shader. As an example, if the particle wants to perform collision detection, I add its index to another structured buffer, also with the D3D11_BUFFER_UAV_FLAG_COUNTER flag.

I then run a second compute shader, which processes all the indices, and applies collision detection to those particles.

However, in the second compute shader, I'd estimate that about 5% of the indices are wrong - they belong to other particles, which don't support collision detection.

Here's the compute shader code that perfroms the list building:

// append to destination buffer
uint dstIndex = g_dstParticles.IncrementCounter();
g_dstParticles[ dstIndex ] = particle;

// add to behaviour lists
if ( params.flags & EMITTER_FLAG_COLLISION )
{
    uint behaviourIndex = g_behaviourCollisionIndices.IncrementCounter();
    g_behaviourCollisionIndices[ behaviourIndex ] = dstIndex;
}

If I split out the "add to behaviour lists" bit into a separate compute shader, and run it after the particle lists are built, everything works perfectly. However I think I shouldn't need to do this - it's a waste of bandwidth going through all the particles again.

I suspect that IncrementCounter is actually not guaranteed to return a unique index into the UAV, and that there is some clever optimisation going on that means the index is only valid inside the compute shader it is used in. And thus my attempt to pass it to the second compute shader is not valid.

Can anyone give any concrete answers to what's going on here? And if there's a way for me to keep the filtering inside the same compute shader as my core update?

Thanks!

1

1 Answers

1
votes

IncrementCounter is an atomic operation and so will (driver/hardware bugs notwithstanding) return a unique value to each thread that calls it.

Have you thought about using Append/Consume buffers for this, as it's what they were designed for? The first pass simply appends the complex collision particles to an AppendStructuredBuffer and the second pass consumes from the same buffer but using a ConsumeStructuredBuffer view instead. The second run of compute will need to use DispatchIndirect so you only run as many thread groups as necessary for the number in the list (something the CPU won't know).

The usual recommendations apply though, have you tried the D3D11 Debug Layer and running it on the reference device to be sure it isn't a driver issue?