2
votes

My Vulkan program is running extremely slow, and I'm trying to figure out why. I've noticed that even a few draw-calls already drain the performance far more than they should. For instance, here's an extract(Pseudocode) for rendering a few meshes:

int32_t numCalls = 0;
int32_t numIndices = 0;
for(auto &mesh : meshes)
{
    auto vertexBuffer = mesh.GetVertexBuffer();
    auto indexBuffer = mesh.GetIndexBuffer();

    vk::DeviceSize offset = 0;
    drawCmd.bindVertexBuffers(0,1,&vertexBuffer,&offset); // drawCmd = CommandBuffer for all drawing commands (single thread)
    drawCmd.bindIndexBuffer(indexBuffer,offset,vk::IndexType::eUint16);

    drawCmd.drawIndexed(mesh.GetIndexCount(),1,0,0,0);

    numIndices += mesh.GetIndexCount();
    ++numCalls;
}

There are 238 meshes being rendered, with a total vertex index count of 52050. The GPU is definitely not overburdened (The shaders are extremely cheap).

If I run my program with the code above, the frame is being rendered in approximately 46ms. Without it it's a mere 9ms.

I'm using fifo present mode with 2 swapchain images. Only a primary command buffer at this time (No secondary command buffers/pre-recorded buffers), same buffer for all frames.

My problem is, I don't really know what to look for. These few rendering calls should barely make a dent, so the source of the problem must be somewhere else.

Can anyone give me any hints how I should tackle this? Are the any profilers around for Vulkan already? I just need a nudge in the right direction.

// EDIT:

So, it looks like vkDeviceWaitIdle takes about 32ms to execute, if all 238 meshes are rendered. (If none are rendered, it's < 1ms). Most of the stalling stems from there, but I still don't know what to do about it.

2
auto vertexBuffer = mesh.GetVertexBuffer(); this results in a copy which might be a bottleneck. - Sebastian Hoffmann
It's not, it's a handle (8 Byte). Same goes for the index buffer. The actual cpp-code has basically no impact on the frame time at all. - Silverlan
vkDeviceWaitIdle What are you calling that for? That's like glFinish; it's something you should never do. Not unless you're doing a major application state transition (and probably not even then) or an application tear-down. - Nicol Bolas
Sounds like vkDeviceWaitIdle is your main problem, but also, do you have validation layers enabled? Using VK_LAYER_LUNARG_standard_validation adds a significant overhead. - Columbo

2 Answers

4
votes

So, it looks like vkDeviceWaitIdle takes about 32ms to execute, if all 238 meshes are rendered. (If none are rendered, it's < 1ms). Most of the stalling stems from there, but I still don't know what to do about it.

Avoid using vkDeviceWaitIdle. It's the heaviest synchronization operation available and will force the GPU to finish and flush all work.

Try using the other, more light-weight synchronization objects like semaphores, barriers, fences and events and specify the access masks and pipeline stages scoped as narrow as possible.

A narrow scope, especially for the pipeline stage, ensures that other parts of the pipeline can continue to work whereas with a vkDeviceWatiIdle you may stall all parts of the pipeline.

3
votes

There is absolutely no reason to use vkDeviceWaitIdle in your render loop.

Instead you should add a vkFence to the vkQueueSubmit call and use vkGetFenceStatus to see if you can touch the memory used by the command buffer.

This would be used like a ring buffer so multiple copies of the mutable data (view matrix and such) are stored until the GPU is done with them.