3
votes

I'm trying to render a large number of very small 2D quads as fast as possible on an Apple A7 GPU using the Metal API. Researching that GPU's triangle throughput numbers, e.g. here, and from Apple quoting >1M triangles on screen during their keynote demo, I'd expect to be able to render something like 500,000 such quads per frame at 60fps. Perhaps a bit less, given that all of them are visible (on screen, not hidden by z-buffer) and tiny (tricky for the rasterizer), so this likely isn't a use case that the GPU is super well optimized for. And perhaps that Apple demo was running at 30fps, so let's say ~200,000 should be doable. Certainly 100,000 ... right?

However, in my test app the max is just ~20,000 -- more than that and the framerate drops below 60 on an iPad Air. With 100,000 quads it runs at 14 fps, i.e. at a throughput of 2.8M trianlges/sec (compare that to the 68.1M onscreen triangles quoted in the AnandTech article!).

Even if I make the quads a single pixel small, with a trivial fragment shader, performance doesn't improve. So we can assume that this is vertex bound, and the GPU report in Xcode agrees ("Tiler" is at 100%). The vertex shader is trivial as well, doing nothing but a little scaling and a translation math, so I'm assuming the bottleneck is some fixed-function stage...?

Just for some more background info, I'm rendering all the geometry using a single instanced draw call, with one quad per instance, i.e. 4 vertices per instance. The quad's positions are applied from a separate buffer that's indexed by instance id in the vertex shader. I've tried a few other methods as well (non-instanced with all vertices pre-transformed, instanced+indexed, etc), but that didn't help. There are no complex vertex attributes, buffer/surface formats, or anything else I can think of that seems likely to hit a slow path in the driver/GPU (though I can't be sure of course). Blending is off. Pretty much everything else is in the default state (things like viewport,scissor,ztest,culling,etc).

The application is written in Swift, though hopefully that doesn't matter ;)

What I'm trying to understand is whether the performance I'm seeing is expected when rendering quads like this (as opposed to a "proper" 3d scene), or whether some more advanced techniques are needed to get anwhere close to the advertised triangle throughputs. What do people think is likely the limiting bottleneck here?

Also, if anyone knows any reason why this might be faster in OpenGL than in Metal (I haven't tried, and can't think of any reason), then I'd love to hear it as well.

Thanks

Edit: adding shader code.

vertex float4 vertex_shader(
        const constant float2* vertex_array [[ buffer(0) ]],
        const device QuadState* quads [[ buffer(1) ]],
        constant const Parms& parms [[ buffer(2) ]],
        unsigned int vid [[ vertex_id ]],
        unsigned int iid [[ instance_id ]] )
{
    float2 v = vertex_array[vid]*0.5f;

    v += quads[iid].position;

    // ortho cam and projection transform
    v += parms.cam.position;
    v *= parms.cam.zoom * parms.proj.scaling;

    return float4(v, 0, 1.0);
}


fragment half4 fragment_shader()
{
    return half4(0.773,0.439,0.278,0.4);
}
1
Can you show us your vertex layout/descriptor and your shader code? In a sample app I have here, I can hit 150ktris per frame on an iPad mini 2, and >300ktris/frame on an iPhone 6. My triangles have an average coverage of 2 pixels a piece.warrenm
Sure thing, I added the shader code above. I don't explicitly set a vertex layout. I've also noticed that it matters a lot for the Tiler how much of the screen is covered by quads (I expected this to matter for the fragment stage, but was surprised to see it influence the vertex stage so much..guess it's a tile caching effect). That is, concentrating all the quads in a small area of the screen rather than uniformly distributing them all over the place improves perf a lot, and then I can hit >100k triangles as well. Perhaps that's how they get to >1M: small objects with very high tri count..lespalt
Yeah, the tiler has a lot to do with it. Most of these tiny triangles will only hit a single tile, and the fewer tiles that have to be moved onto the GPU, the less the tiler overhead will be. FWIW, I don't see anything flagrantly wrong with your shader.warrenm

1 Answers

1
votes

Without seeing your Swift/Objective-C code I cannot be sure, but my guess is you are spending too much time calling your instancing code. Instancing is useful when you have a model with hundreds of triangles in it, not for two.

Try creating a vertex buffer with 1000 quads in it and see if the performance increases.