DirectX 11, Combining pixel shaders to prevent bottlenecks

Question

I'm trying to implement one complex algorithm using GPU. The only problem is HW limitations and maximum available feature level is 9_3.

Algorithm is basically "stereo matching"-like algorithm for two images. Because of mentioned limitations all calculations has to be performed in Vertex/Pixel shaders only (there is no computation API available). Vertex shaders are rather useless here so I considered them as pass-through vertex shaders.

Let me shortly describe the algorithm:

Take two images and calculate cost volume maps (basically conterting RGB to Grayscale -> translate right image by D and subtract it from the left image). This step is repeated around 20 times for different D which generates Texture3D.

Problem here: I cannot simply create one Pixel Shader which calculates those 20 repetitions in one go because of size limitation of Pixel Shader (max. 512 arithmetics), so I'm forced to call Draw() in a loop in C++ which unnecessary involves CPU while all operations are done on the same two images - it seems to me like I have one bottleneck here. I know that there are multiple render targets but: there are max. 8 targets (I need 20+), if I want to generate 8 results in one pixel shader I exceed it's size limit (512 arithmetic for my HW).
Then I need to calculate for each of calculated textures box filter with windows where r > 9.

Another problem here: Because window is so big I need to split box filtering into two Pixel Shaders (vertical and horizontal direction separately) because loops unrolling stage results with very long code. Manual implementation of those loops won't help cuz still it would create to big pixel shader. So another bottleneck here - CPU needs to be involved to pass results from temp texture (result of V pass) to the second pass (H pass).
Then in next step some arithmetic operations are applied for each pair of results from 1st step and 2nd step.

I haven't reach yet here with my development so no idea what kind of bottlenecks are waiting for me here.
Then minimal D (value of parameter from 1st step) is taken for each pixel based on pixel value from step 3.

... same as in step 3.

Here basically is VERY simple graph showing my current implementation (excluding steps 3 and 4).

Red dots/circles/whatever are temporary buffers (textures) where partial results are stored and at every red dot CPU is getting involved.

Question 1: Isn't it possible somehow to let GPU know how to perform each branch form up to the bottom without involving CPU and leading to bottleneck? I.e. to program sequence of graphics pipelines in one go and then let the GPU do it's job.

One additional question about render-to-texture thing: Does all textures resides in GPU memory all the time even between Draw() method calls and Pixel/Vertex shaders switching? Or there is any transfer from GPU to CPU happening... Cuz this may be another issue here which leads to bottleneck.

Any help would be appreciated!

Thank you in advance.

Best regards, Lukasz

Ivan Aksamentov - Drop Ivan Aksamentov - Drop · Accepted Answer · 2013-09-23T15:36:43

Writing computational algorithms in pixel shaders can be very difficult. Writing such algorithms for 9_3 target can be impossible. Too much restrictions. But, well, I think I know how to workaround your problems.

1. Shader repetition

First of all, it is unclear, what do you call "bottleneck" here. Yes, theoretically, draw calls in for loop is a performance loss. But does it bottleneck? Does your application really looses performance here? How much? Only profilers (CPU and GPU) can answer. But to run it, you must first complete your algorithm (stages 3 and 4). So, I'd better stick with current solution, and started to implement whole algorithm, then profile and than fix performance issues.

But, if you feel ready to tweaks... Common "repetition" technology is instancing. You can create one more vertex buffer (called instance buffer), which will contains parameters not for each vertex, but for one draw instance. Then you do all the stuff with one DrawInstanced() call.

For you first stage, instance buffer can contain your D value and index of target Texture3D layer. You can pass-through them from vertex shader.

As always, you have a tradeof here: simplicity of code to (probably) performance.

2. Multi-pass rendering

CPU needs to be involved to pass results from temp texture (result of V pass) to the second pass (H pass)

Typically, you do chaining like this, so no CPU involved:

// Pass 1: from pTexture0 to pTexture1
// ...set up pipeline state for Pass1 here...
pContext->PSSetShaderResources(slot, 1, pTexture0); // source
pContext->OMSetRenderTargets(1, pTexture1, 0);      // target
pContext->Draw(...);

// Pass 2: from pTexture1 to pTexture2
// ...set up pipeline state for Pass1 here...
pContext->PSSetShaderResources(slot, 1, pTexture1); // previous target is now source
pContext->OMSetRenderTargets(1, pTexture2, 0);
pContext->Draw(...);
// Pass 3: ...

Note, that pTexture1 must have both D3D11_BIND_SHADER_RESOURCE and D3D11_BIND_RENDER_TARGET flags. You can have multiple input textures and multiple render targets. Just make sure, that every next pass knows what previous pass outputs. And if previous pass uses more resources than current, don't forget to unbind unneeded, to prevent hard-to-find errors:

pContext->PSSetShaderResources(2, 1, 0);
pContext->PSSetShaderResources(3, 1, 0);
pContext->PSSetShaderResources(4, 1, 0);
// Only 0 and 1 texture slots will be used

3. Resource data location

Does all textures resides in GPU memory all the time even between Draw() method calls and Pixel/Vertex shaders switching?

We can never know that. Driver chooses appropriate location for resources. But if you have resources created with DEFAULT usage and 0 CPU access flag, you can be almost sure it will always be in video memory.

Hope it helps. Happy coding!

DirectX 11, Combining pixel shaders to prevent bottlenecks

1 Answers