2
votes

Below is a simple vertex and fragment shader combo in metal that renders 64 identical 2D quads.

vertex VertexOut vertexMain(uint k [[ vertex_id ]],
                            uint ii [[instance_id]],
                            device float2* tex [[buffer(2)]],
                            device float2* position [[buffer(1)]],
                            device float* state [[buffer(0)]]){
    VertexOut output;
    int i = 4*ii+1;
    float2 pos = position[k];
    pos *= float2(state[i+2],state[i+3]);
    pos += float2(state[i],state[i+1]);
    pos.x *= state[0];
    output.position = float4(pos,0,1);
    output.tex = tex[k]*float2(du,dv);
    return output;
};
fragment float4 fragmentMain(VertexOut input [[stage_in]],
                             texture2d<float> texture [[texture(0)]],
                             sampler sam [[sampler(0)]] ){
    return texture.sample(sam, input.tex);
};

The sampler is using normalized coordinates so du and dv can range from 0 to 1 and control how large of a clip of the texture will be sampled starting at the lower left corner.

It seems I have a misunderstanding about how sampling works in metal. I would expect the computational cost to remain constant no matter what values du and dv hold. However as I increase du and dv to 1 the frame rate drops. I am not using any mipmapping nor am I changing the size of the quads that are rasterized on screen. The affect is more dramatic with linear filtering but happens with nearest filtering as well. It seems to me that since the number of pixels drawn to the screen is the same then the load on the GPU should not depend on du and dv. What am I missing?

EDIT: Here is my sampler and color attachment:

    let samplerDescriptor = MTLSamplerDescriptor()
    samplerDescriptor.normalizedCoordinates = true
    samplerDescriptor.minFilter = .linear
    samplerDescriptor.magFilter = .linear
    let sampler = device.makeSamplerState(descriptor: samplerDescriptor)

    let attachment = pipelineStateDescriptor.colorAttachments[0]
            attachment?.isBlendingEnabled = true
            attachment?.sourceRGBBlendFactor = .one
            attachment?.destinationRGBBlendFactor = .oneMinusSourceAlpha
1
Can you quantify how much of a frame rate drop you experience?warrenm
From 60 to 40 with linear sampling. From 60 to 50 with nearest sampling.gloo
On which device and OS version?warrenm
iPad mini and iPad Pro 9.7 both running 10.2gloo

1 Answers

0
votes

As you increase du and dv your quads are displaying more of your texture. GPUs tend to have small-ish caches for texture data, and as you display more of your texture, you'll be discarding and refilling that cache more.

Thrashing the texture cache will use more memory bandwidth which is quite a limited resource, often texture memory bandwidth is not the bottleneck, but as your fragment shader is doing almost nothing other than texture fetches, it isn't a surprise that it's your bottleneck. Therefore, it isn't a surprise that altering your UVs has an effect on performance.

What is a surprise is that the framerate drops below 60 on these very powerful devices, when all you're doing is rendering 64 quads (iPad Pro in particular is a very powerful device). That said, maybe if all 64 quads were covering most of the screen, the framerate drop could be understandable.

To improve performance, you'd need to decrease the amount of texture data that needs to shovelled around by the GPU. Changing from 32-bit texture format (8888) to 16-bit (565/4444) or 4-bit (PVRTC compressed textures) would have a big impact.

The really big win is probably to enable mipmapping. Assuming that with high values of du and dv, you end up minimizing the texture, then using mipmapping will give a huge performance benefit, and as an added bonus your textures will look nicer too (it'll fix aliasing). Not a bad return for a 33% texture memory increase.