I am just beginning to learn DirectX programming, using F# and SharpDX as .NET wrapper. As a test case I render the Mandelbrot set. The computation is done using 2 compute shaders.
The first shader computes the depth for each pixel (function "CalcMandel"), the results are stored in a RWStructuredBuffer. This computation requires a lot of single or double multiplications, yet it’s blindingly fast on my GPU (AMD 7790). “CalcMandel” has the attribute
[numthreads(16, 16, 1)]
and is dispatched via
context.Dispatch (imageWidth / 16, imageHeight / 16, 1)
NO PROBLEM here – a 1000 x 800 pixel image of the “core” Mandelbrot set runs with over 1000 fps (using single precision on the GPU).
The second shader has almost nothing do: it computes the min, max, and average of the previous computation (function “CalcMinMax”). “CalcMinMax” has the attribute
[numthreads(1, 1, 1)]
and is called via
context.Dispatch (1,1,1)
For then given image size a single GPU thread has to traverse the buffer over 800.000 integers to calculate the min, max and average. I use a single thread because I have no idea how to implement this calculation in a parallel fashion.
THE PROBLEM: “CalcMinMax” is terribly slow: the frame rate drops from over 1000 to 5 fps!
MY QUESTIONS: What’s wrong here? Am I using the wrong setup / parameters (numthreads)? How can I speed up the min-max calculation?
MY THOUGHTS: My first assumption was that accessing the RWBuffer might be slow – this is not the case. When I replaced the buffer access by a constant, the frame rate did not increase.
My GPU has appr. 900 shader cores and uses thousands of threads to compute the Mandelbrot set, while “CalcMinMax” uses only one thread. Yet, I still don’t understand why things get so slow.
I would appreciate any advice!
================================================
// HLSL CONTENT (Mandelbrot set calculation is omitted):
cbuffer cbCSMandel : register( b0 )
{
double a0, b0, da, db;
double ja0, jb0;
int max_iterations;
bool julia; int cycle;
int width; int height;
double colorFactor;
int algoIndex;
int step;
};
struct statistics
{
int minDepth;
int axDepth;
float avgDepth;
int loops;
};
RWStructuredBuffer<float4> colorOutputTable : register (u0);
StructuredBuffer<float4> output2 : register (t0);
RWStructuredBuffer<int> counterTable : register (u1);
RWStructuredBuffer<float4> colorTable : register (u2);
RWStructuredBuffer<statistics>statsTable : register (u3);
// Mandelbrot calculations….
// Results are written to counterTable and colorOutputTable
// I limit the samples to 10000 pixels because calcMinMax is too slow
#define NUM_SAMPLES 10000;
void calcMinMax()
{
int minDepth = 64000;
int maxDepth = 0;
int len = width * height;
int crit = len / NUM_SAMPLES;
int steps = max (crit, 1);
int index = 0;
int sumCount = 0;
float sum = 0.0;
while (index < len)
{
int cnt = counterTable[index];
minDepth = cnt < minDepth & cnt > 0 ? cnt : minDepth;
maxDepth = cnt > maxDepth ? cnt : maxDepth;
sum += cnt > 0 ? cnt : 0.0f;
sumCount += cnt > 0 ? 1 : 0;
index += steps;
}
statsTable[0].minDepth = minDepth;
statsTable[0].maxDepth = maxDepth;
statsTable[0].avgDepth = sum / sumCount;
statsTable[0].loops += 1;
}
[numthreads(1, 1, 1)]
void CalcMinMax ( uint3 Gid : SV_GroupID, uint3 DTid : SV_DispatchThreadID, uint3 GTid : SV_GroupThreadID, uint GI : SV_GroupIndex )
{
switch (GI) // this switch is used to verify GI number (always 0)
{
case 0: calcMinMax();
break;
default: ;
break;
}
}
// ******************* F# program (min-max part) *************
shader setup:
use minMaxShaderCode = ShaderBytecode.CompileFromFile(shaderPath, "CalcMinMax", "cs_5_0")
minMaxShader <- new ComputeShader(device, minMaxShaderCode.Bytecode.Data )
shader usage:
// ---------- CONNECT MinMap Shader
context.ComputeShader.Set(minMaxShader)
context.ComputeShader.SetUnorderedAccessView(STATS_SLOT, statsBuffer.BufferView)
context.ComputeShader.SetConstantBuffer(CONSTANT_SLOT, constantBuffer)
context.ComputeShader.SetUnorderedAccessView (COUNTER_SLOT, dataBuffer.BufferView)
context.Dispatch (1,1,1)
// ---------- DISCONNECT MinMap Shader
context.ComputeShader.SetConstantBuffer(CONSTANT_SLOT, null)
context.ComputeShader.SetUnorderedAccessView (STATS_SLOT, null)
context.ComputeShader.SetUnorderedAccessView (COUNTER_SLOT, null)
context.ComputeShader.Set (null)
reading statistics:
context.CopyResource(statsBuffer.DataBuffer, statsBuffer.StagingBuffer)
let boxer, stream = context.MapSubresource(statsBuffer.StagingBuffer, MapMode.Read, MapFlags.None)
calcStatistics <- stream.Read<CalcStatistics>()
context.UnmapSubresource(statsBuffer.DataBuffer, 0)