3
votes

Intro

I am trying to render squares in DirectX 11 in the most efficient way. Each square has a color (float3) and a position (float3). Typical count of squares is about 5 millions.

I tried 3 ways:

  1. Render raw data
  2. Use geometry shader
  3. Use instanced rendering

Raw data means, that each square is represented as 4 vertices in vertex buffer and two triangles in index buffer.

Geometry shader and instanced rendering mean, that each square has just one vertex in vertex buffer.

My results (on nvidia GTX960M) for 5M squares are:

  • Geometry shader 22 FPS
  • Instanced rendering 30 FPS
  • Raw data rendering 41 FPS

I expected that geometry shader is not the most efficient method. On the other hand I am surprised that Instanced rendering is slower than raw data. Computation in vertex shader is exactly the same. It is just multiplication with transform matrix stored in constant buffer + addition of Shift variable.

Raw data input

struct VSInput{
    float3 Position : POSITION0;
    float3 Colot : COLOR0;
    float2 Shift : TEXCOORD0;// This is xy deviation from square center 
};

Instanced rendering input

struct VSInputPerVertex{
    float2 Shift : TEXCOORD0;    
};

struct VSInputPerInstance{
    float3 Position : POSITION0;
    float3 Colot : COLOR0;       
};

Note

For bigger models (20M squares) is more efficient instanced rendering (evidently because of memory traffic).

Question

Why is instanced rendering slower (in case of 5M squares), than raw data rendering? Is there another efficient way how to accomplish this rendering task? Am I missing something?

Edit

StrcturedBuffer method

One of possible solutions is to use StructuredBuffer as @galop1n suggested (for details see his answer).

My results (on nvidia GTX960M) for 5M squares

  • StructuredBuffer 48 FPS

Observations

  • Sometimes I observed that StructuredBuffer method was oscilating between 30 FPS - 55 FPS (accumulated number from 100 frames). It seems to be little unstable. Median is 48 FPS. I did not observe this using previous methods.
  • Consider balance between draw calls and StructuredBuffer sizes. I reached the fastest behavior, when I used buffers with 1K - 4K points, for smaller models. When I tried to render 5M square model, I had big number of draw calls and it was not efficient (30 FPS). The best behavior I observe with 5M squares was with 16K points per buffer. 32K and 8K points per buffer seemed to be slower settings.
1
Only 55fps to render 5 millions quads looks bad to me, of course it depends on the complexity of the pixel shader or the fillrate cross resolution. What is your peak rate with a viewport of 1x1 or if you push all the quads offscreen ? Because your SRV is probably dynamic, it may not be the best memory type and be careful to write in order your data without any read back. It is also possible that you are CPU limited ? The batch size should not be an issue either.galop1n

1 Answers

3
votes

Small vertex count per instance is usually a good way to underused the hardware. I suggest you that variant, it should provide good performance on every vendors.

VSSetShaderResourceViews(0,1,&quadData);
SetPrimitiveTopology(TRIANGLE);
Draw( 6 *  quadCount, 0);

In the vertex shader, you have

struct Quad {
    float3 pos;
    float3 color;
};
StructuredBuffer<Quad> quads : register(t0);

And to rebuild you quads in the vertex shader :

// shift for each vertex
static const float2 shifts[6] = { float2(-1,-1), ..., float2(1,1) };
void main( uint vtx : SV_VertexID, out YourStuff yourStuff) {
    Quad quad = quads[vtx/6];
    float2 offs = shifts[vtx%6];
}

Then rebuild the vertex and transform as usual. You have to note, because you bypass the input assembly stage, if you want to send colors as rgba8, you need to use a uint and unpack yourself manually. The bandwidth usage will lower if you have millions of quads to draw.