I want to implement a fluid simulation. Something like this. The algorithm is not important. The important issue is that if we were to implement it in a pixel shader, it should be done in multiple passes.
The problem with the technique I used to do it is performance is very bad. I'll explain an overview of what's happening and the technique used for solving the calculation in one pass and then the timing information.
Overview:
We have a terrain and we want to rain over it and see the water flow. We have data on 1024x1024 textures. We have the height of terrain and the amount of water in each point. This is an iterative simulation. Iteration 1 gets terrain and water textures as input, calculates and then writes the results on terrain and water textures. Iteration 2 then runs and again changes textures a little bit more. After hundreds of iterations we have something like this:
In each iteration these stages happen:
- Fetch terrain and water height.
- Calculate Flow.
- Write Flow value into groupshared memory.
- Sync Group Memory
- Read Flow value from groupshared memory for this thread and the threads in the left,right,top, and bottom of current thread.
- Calculate new value for water based on Flow values read in previous step.
- Write results to Terrain and Water textures.
So basically we fetch data, do calculate1, put calculate1 results to shared memory, sync, fetch from shared memory for current thread and neighbors, do calculate2 , and write results.
This is a clear pattern that happens in a very wide range of image processing problems. The classic solution would be a multi-pass shader but I did it in one pass compute shader to save bandwidth.
Technique:
I used the technique explained in Practical Rendering and Computation with Direct3D 11 chapter 12. Assume we want each thread group to be 16x16x1 threads. But because the second calculation needs neighbours too, we pad the pixels in each direction. meaning we'll have 18x18x1 thread groups. Because of this padding we will have valid neighbors in second calculation. Here is a picture showing padding. Yellow threads are the ones that need to be calculated and the red ones are in padding. They are part of thread group but we just use them for intermediate processing and won't save them to textures. Please note that in this picture the group with padding is 10x10x1 but our thread group is 18x18x1.
The process runs and returns correct result. The only problem is performance.
Timing: On system with Geforce GT 710 I run the simulation with 10000 iterations.
- It takes 60 seconds to run the full and correct simulation.
- If I don't pad borders and use 16x16x1 thread groups, The time will be 40 secs. Obviously the results are wrong.
- If I don't use groupshared memory and feed the second calculation with dummy values, the time would be 19 secs. The results are of course wrong.
Questions:
- Is this the best technique to solve this problem? If we instead calculate in two different kernels, it would be faster. 2x19<60.
- Why group shared memory is too damn slow?
Here is the compute shader code. It is the correct version that takes 60 sec:
#pragma kernel CSMain
Texture2D<float> _waterAddTex;
Texture2D<float4> _flowTex;
RWTexture2D<float4> _watNormTex;
RWTexture2D<float4> _flowOutTex;
RWTexture2D<float> terrainFieldX;
RWTexture2D<float> terrainFieldY;
RWTexture2D<float> waterField;
SamplerState _LinearClamp;
SamplerState _LinearRepeat;
#define _gpSize 16
#define _padGPSize 18
groupshared float4 f4shared[_padGPSize * _padGPSize];
float _timeStep, _resolution, _groupCount, _pixelMeter, _watAddStrength, watDamping, watOutConstantParam, _evaporation;
int _addWater, _computeWaterNormals;
float2 _rainUV;
bool _usePrevOutflow,_useStava;
float terrHeight(float2 texData) {
return dot(texData, identity2);
}
[numthreads(_padGPSize, _padGPSize, 1)]
void CSMain(int2 groupID : SV_GroupID, uint2 dispatchIdx : SV_DispatchThreadID, uint2 padThreadID : SV_GroupThreadID)
{
int2 id = groupID * _gpSize + padThreadID - 1;
int gsmID = padThreadID.x + _padGPSize * padThreadID.y;
float2 uv = (id + 0.5) / _resolution;
bool outOfGroupBound = (padThreadID.x == 0 || padThreadID.y == 0 || padThreadID.x == _padGPSize - 1
|| padThreadID.y == _padGPSize - 1) ? true : false;
// -------------FETCH-------------
float2 cenTer, lTer, rTer, tTer, bTer;
sampleUavNei(terrainFieldX,terrainFieldY, id, cenTer, lTer, rTer, tTer, bTer);
float cenWat, lWat, rWat, tWat, bWat;
sampleUavNei(waterField, id, cenWat, lWat, rWat, tWat, bWat);
// -------------Calculate 1-------------
float cenTerHei = terrHeight(cenTer);
float cenTotHei = cenWat + cenTerHei;
float4 neisTerHei = float4(terrHeight(lTer), terrHeight(rTer), terrHeight(tTer), terrHeight(bTer));
float4 neisWat = float4(lWat, rWat, tWat, bWat);
float4 neisTotHei = neisWat + neisTerHei;
float4 neisTotHeiDiff = cenTotHei - neisTotHei;
float4 prevOutflow = _usePrevOutflow? _flowTex.SampleLevel(_LinearClamp, uv, 0):float4(0,0,0,0);
float4 watOutflow;
float4 flowFac = min(abs(neisTotHeiDiff), (cenWat + neisWat) * 0.5f);
flowFac = min(1, flowFac);
watOutflow = max(watDamping* prevOutflow + watOutConstantParam * neisTotHeiDiff * flowFac, 0);
float outWatFac = cenWat / max(dot(watOutflow, identity4) * _timeStep, 0.001f);
outWatFac = min(outWatFac, 1);
watOutflow *= outWatFac;
// -------------groupshared memory-------------
f4shared[gsmID] = watOutflow;
GroupMemoryBarrierWithGroupSync();
float4 cenFlow = f4shared[gsmID];
float4 lFlow = f4shared[gsmID - 1];
float4 rFlow = f4shared[gsmID + 1];
float4 tFlow = f4shared[gsmID + _padGPSize];
float4 bFlow = f4shared[gsmID - _padGPSize];
//float4 cenFlow = 0;
//float4 lFlow = 0;
//float4 rFlow = 0;
//float4 tFlow = 0;
//float4 bFlow = 0;
// -------------Calculate 2-------------
if (!outOfGroupBound) {
float watDiff = _timeStep *((lFlow.y + rFlow.x + tFlow.w + bFlow.z) - dot(cenFlow, identity4));
cenWat = cenWat + watDiff - _evaporation;
cenWat = max(cenWat, 0);
}
// -------------End of calculation-------------
//Water Addition
if (_addWater)
cenWat += _timeStep * _watAddStrength * _waterAddTex.SampleLevel(_LinearRepeat, uv + _rainUV, 0);
if (_computeWaterNormals)
_watNormTex[id] = float4(0, 1, 0, 0);
// -------------Write results-------------
if (!outOfGroupBound) {
_flowOutTex[id] = cenFlow;
waterField[id] = cenWat;
}
}

