Any technique to convert multi-pass pixel shader to compute shader?

Question

I want to implement a fluid simulation. Something like this. The algorithm is not important. The important issue is that if we were to implement it in a pixel shader, it should be done in multiple passes.

The problem with the technique I used to do it is performance is very bad. I'll explain an overview of what's happening and the technique used for solving the calculation in one pass and then the timing information.

Overview:

We have a terrain and we want to rain over it and see the water flow. We have data on 1024x1024 textures. We have the height of terrain and the amount of water in each point. This is an iterative simulation. Iteration 1 gets terrain and water textures as input, calculates and then writes the results on terrain and water textures. Iteration 2 then runs and again changes textures a little bit more. After hundreds of iterations we have something like this:

In each iteration these stages happen:

Fetch terrain and water height.
Calculate Flow.
Write Flow value into groupshared memory.
Sync Group Memory
Read Flow value from groupshared memory for this thread and the threads in the left,right,top, and bottom of current thread.
Calculate new value for water based on Flow values read in previous step.
Write results to Terrain and Water textures.

So basically we fetch data, do calculate1, put calculate1 results to shared memory, sync, fetch from shared memory for current thread and neighbors, do calculate2 , and write results.

This is a clear pattern that happens in a very wide range of image processing problems. The classic solution would be a multi-pass shader but I did it in one pass compute shader to save bandwidth.

Technique:

I used the technique explained in Practical Rendering and Computation with Direct3D 11 chapter 12. Assume we want each thread group to be 16x16x1 threads. But because the second calculation needs neighbours too, we pad the pixels in each direction. meaning we'll have 18x18x1 thread groups. Because of this padding we will have valid neighbors in second calculation. Here is a picture showing padding. Yellow threads are the ones that need to be calculated and the red ones are in padding. They are part of thread group but we just use them for intermediate processing and won't save them to textures. Please note that in this picture the group with padding is 10x10x1 but our thread group is 18x18x1.

The process runs and returns correct result. The only problem is performance.

Timing: On system with Geforce GT 710 I run the simulation with 10000 iterations.

It takes 60 seconds to run the full and correct simulation.
If I don't pad borders and use 16x16x1 thread groups, The time will be 40 secs. Obviously the results are wrong.
If I don't use groupshared memory and feed the second calculation with dummy values, the time would be 19 secs. The results are of course wrong.

Questions:

Is this the best technique to solve this problem? If we instead calculate in two different kernels, it would be faster. 2x19<60.
Why group shared memory is too damn slow?

Here is the compute shader code. It is the correct version that takes 60 sec:

#pragma kernel CSMain


Texture2D<float> _waterAddTex;
Texture2D<float4> _flowTex;

RWTexture2D<float4> _watNormTex;
RWTexture2D<float4> _flowOutTex;

RWTexture2D<float> terrainFieldX;
RWTexture2D<float> terrainFieldY;
RWTexture2D<float> waterField;


SamplerState _LinearClamp;
SamplerState _LinearRepeat;

#define _gpSize 16
#define _padGPSize 18

groupshared float4 f4shared[_padGPSize * _padGPSize];

float _timeStep, _resolution, _groupCount, _pixelMeter, _watAddStrength, watDamping, watOutConstantParam, _evaporation;
int _addWater, _computeWaterNormals;
float2 _rainUV;
bool _usePrevOutflow,_useStava;

float terrHeight(float2 texData) {
    return dot(texData, identity2);
}

[numthreads(_padGPSize, _padGPSize, 1)]
void CSMain(int2 groupID : SV_GroupID, uint2 dispatchIdx : SV_DispatchThreadID, uint2 padThreadID : SV_GroupThreadID)
{

    int2 id = groupID * _gpSize + padThreadID - 1;
    int gsmID = padThreadID.x + _padGPSize * padThreadID.y;
    float2 uv = (id + 0.5) / _resolution;

    bool outOfGroupBound = (padThreadID.x == 0 || padThreadID.y == 0 || padThreadID.x == _padGPSize - 1
        || padThreadID.y == _padGPSize - 1) ? true : false;

    // -------------FETCH-------------

    float2 cenTer, lTer, rTer, tTer, bTer;
    sampleUavNei(terrainFieldX,terrainFieldY, id, cenTer, lTer, rTer, tTer, bTer);

    float cenWat, lWat, rWat, tWat, bWat;
    sampleUavNei(waterField, id, cenWat, lWat, rWat, tWat, bWat);

    // -------------Calculate 1-------------

    float cenTerHei = terrHeight(cenTer);
    float cenTotHei = cenWat + cenTerHei;
    float4 neisTerHei = float4(terrHeight(lTer), terrHeight(rTer), terrHeight(tTer), terrHeight(bTer));
    float4 neisWat = float4(lWat, rWat, tWat, bWat);
    float4 neisTotHei = neisWat + neisTerHei;
    float4 neisTotHeiDiff = cenTotHei - neisTotHei;


    float4 prevOutflow = _usePrevOutflow? _flowTex.SampleLevel(_LinearClamp, uv, 0):float4(0,0,0,0);
    float4 watOutflow;
            float4 flowFac = min(abs(neisTotHeiDiff), (cenWat + neisWat) * 0.5f);
            flowFac = min(1, flowFac);
            watOutflow = max(watDamping* prevOutflow + watOutConstantParam * neisTotHeiDiff * flowFac, 0);
            float outWatFac = cenWat / max(dot(watOutflow, identity4) * _timeStep, 0.001f);
            outWatFac = min(outWatFac, 1);
            watOutflow *= outWatFac;

    // -------------groupshared memory-------------
    f4shared[gsmID] = watOutflow;

    GroupMemoryBarrierWithGroupSync();

    float4 cenFlow = f4shared[gsmID];
    float4 lFlow = f4shared[gsmID - 1];
    float4 rFlow = f4shared[gsmID + 1];
    float4 tFlow = f4shared[gsmID + _padGPSize];
    float4 bFlow = f4shared[gsmID - _padGPSize];

    //float4 cenFlow = 0;
    //float4 lFlow = 0;
    //float4 rFlow = 0; 
    //float4 tFlow = 0;
    //float4 bFlow = 0;

    // -------------Calculate 2-------------

    if (!outOfGroupBound) {
            float watDiff = _timeStep *((lFlow.y + rFlow.x + tFlow.w + bFlow.z) - dot(cenFlow, identity4));
            cenWat = cenWat + watDiff - _evaporation;
            cenWat = max(cenWat, 0);
    }

    // -------------End of calculation-------------

    //Water Addition
    if (_addWater)
        cenWat += _timeStep * _watAddStrength * _waterAddTex.SampleLevel(_LinearRepeat, uv + _rainUV, 0);

    if (_computeWaterNormals)
        _watNormTex[id] = float4(0, 1, 0, 0);

    // -------------Write results-------------

    if (!outOfGroupBound) {
        _flowOutTex[id] = cenFlow;
        waterField[id] = cenWat;
    }

}

morteza khosravi morteza khosravi · Accepted Answer · 2017-02-11T14:17:32

OK I did a terrible mistake. To test the performance of groupshared memory I did the following:

// -------------groupshared memory-------------
f4shared[gsmID] = watOutflow;

GroupMemoryBarrierWithGroupSync();

float4 cenFlow = f4shared[gsmID];
float4 lFlow = f4shared[gsmID - 1];
float4 rFlow = f4shared[gsmID + 1];
float4 tFlow = f4shared[gsmID + _padGPSize];
float4 bFlow = f4shared[gsmID - _padGPSize];

//float4 cenFlow = 0;
//float4 lFlow = 0;
//float4 rFlow = 0; 
//float4 tFlow = 0;
//float4 bFlow = 0;

You see there are some variables after the sync. I did test with the top five statements that initialize the variables to correct values. Then to test without groupshared I commented the five statements and uncommented the five below one which initialize the variables to zero. This is how it became:

// -------------groupshared memory-------------
//f4shared[gsmID] = watOutflow;

//GroupMemoryBarrierWithGroupSync();

//float4 cenFlow = f4shared[gsmID];
//float4 lFlow = f4shared[gsmID - 1];
//float4 rFlow = f4shared[gsmID + 1];
//float4 tFlow = f4shared[gsmID + _padGPSize];
//float4 bFlow = f4shared[gsmID - _padGPSize];

float4 cenFlow = 0;
float4 lFlow = 0;
float4 rFlow = 0; 
float4 tFlow = 0;
float4 bFlow = 0;

And then the process got much faster. The one thing I didn't notice was that I no longer used the previous computations and the compiler optimized them away. After implementing more of the algorithm I realized the groupshared memory is actually doing well and takes a minor time compared to reading textures.

After all I did manage to do the algorithm in 1 pass. It is 2.5 times faster than the original implementation and the technique described in the questions works gracefully in my case.

Any technique to convert multi-pass pixel shader to compute shader?

1 Answers