1
votes

I am trying to implement multi threaded command buffer generation (using per-thread command pool and secondary command buffers), but there are little performance gain of using multiple threads.

First, I thought that my thread pool code was incorrectly written, but I tried Sascha Willems's thread pool implementation, and nothing changed (so I don't think that's an issue)

Second, I searched for multi threading performance issues and I found that accessing same variables/resources from different thread causes performance drop, but still i can't figure out the problem in my case.

I also downloaded Sascha Willems's multi threading code, run it, and it worked just fine. I modified the number of working threads and the performance gain using multiple threads is clearly visible.

Here are some FPS results for rendering 600 objects (same model). You can see what my problem is:

core count      Sascha Willems's        my result           my result (avg. FPS)
              result ( avg. FPS)       (avg. FPS)        validation layer disabled

    1               45                      30                      55
    2               83                      33                      72
    4               110                     40                      84
    6               155                     42                      103
    8               162                     42                      104
    10              173                     40                      111
    12              175                     40                      119

This is where i prepare the thread data

void prepareThreadData
{
primaryCommandPool = m_device.createCommandPool (
    vk::CommandPoolCreateInfo (
        vk::CommandPoolCreateFlags(vk::CommandPoolCreateFlagBits::eResetCommandBuffer),
        graphicsQueueIdx
    )
);

primaryCommandBuffer = m_device.allocateCommandBuffers (
    vk::CommandBufferAllocateInfo (
        primaryCommandPool,
        vk::CommandBufferLevel::ePrimary,
        1
    )
)[0];

threadData.resize(numberOfThreads);

for (int i = 0; i < numberOfThreads; ++i)
{
    threadData[i].commandPool = m_device.createCommandPool (
        vk::CommandPoolCreateInfo (
            vk::CommandPoolCreateFlags(vk::CommandPoolCreateFlagBits::eResetCommandBuffer),
            graphicsQueueIdx
        )
    );

    threadData[i].commandBuffer = m_device.allocateCommandBuffers (
        vk::CommandBufferAllocateInfo (
            threadData[i].commandPool,
            vk::CommandBufferLevel::eSecondary,
            numberOfObjectsPerThread
        )
    );

    for (int j = 0; j < numberOfObjectsPerThread; ++j)
    {
        VertexPushConstant pushConstant = { someRandomPosition()};
        threadData[i].pushConstBlock.push_back(pushConstant);
    }
}
}

Here is my render loop code where i give job for each thread:

while (!display.IsWindowClosed())
{
display.PollEvents();

m_device.acquireNextImageKHR(m_swapChain, std::numeric_limits<uint64_t>::max(), presentCompleteSemaphore, nullptr, &currentBuffer);

primaryCommandBuffer.begin(vk::CommandBufferBeginInfo());
primaryCommandBuffer.beginRenderPass(
    vk::RenderPassBeginInfo(m_renderPass, m_swapChainBuffers[currentBuffer].frameBuffer, m_renderArea, clearValues.size(), clearValues.data()),
    vk::SubpassContents::eSecondaryCommandBuffers);

vk::CommandBufferInheritanceInfo inheritanceInfo = {};
inheritanceInfo.renderPass = m_renderPass;
inheritanceInfo.framebuffer = m_swapChainBuffers[currentBuffer].frameBuffer;

for (int t = 0; t < numberOfThreads; ++t)
{
    for (int i = 0; i < numberOfObjectsPerThread; ++i)
    {
        threadPool.threads[t]->addJob([=]
        {
            std::array<vk::DeviceSize, 1> offsets = { 0 };
            vk::Viewport viewport = vk::Viewport(0.0f, 0.0f, WIDTH, HEIGHT, 0.0f, 1.0f);
            vk::Rect2D renderArea = vk::Rect2D(vk::Offset2D(), vk::Extent2D(WIDTH, HEIGHT));

            threadData[t].commandBuffer[i].begin(vk::CommandBufferBeginInfo(vk::CommandBufferUsageFlagBits::eRenderPassContinue, &inheritanceInfo));
            threadData[t].commandBuffer[i].setViewport(0, viewport);
            threadData[t].commandBuffer[i].setScissor(0, renderArea);
            threadData[t].commandBuffer[i].bindPipeline(vk::PipelineBindPoint::eGraphics, m_graphicsPipeline);
            threadData[t].commandBuffer[i].bindVertexBuffers(VERTEX_BUFFER_BIND, 1, &model.vertexBuffer, offsets.data());
            threadData[t].commandBuffer[i].bindIndexBuffer(model.indexBuffer, 0, vk::IndexType::eUint32);
            threadData[t].commandBuffer[i].pushConstants(pipelineLayout, vk::ShaderStageFlagBits::eVertex, 0, sizeof(VertexPushConstant), &threadData[t].pushConstBlock[i]);
            threadData[t].commandBuffer[i].drawIndexed(model.indexCount, 1, 0, 0, 0);
            threadData[t].commandBuffer[i].end();
        });
    }
}

threadPool.wait();

std::vector<vk::CommandBuffer> commandBuffers;
for (int t = 0; t < numberOfThreads; ++t)
{
    for (int i = 0; i < numberOfObjectsPerThread; ++i)
    {
        commandBuffers.push_back(threadData[t].commandBuffer[i]);
    }
}

primaryCommandBuffer.executeCommands(commandBuffers.size(), commandBuffers.data());
primaryCommandBuffer.endRenderPass();
primaryCommandBuffer.end();

submitQueue(presentCompleteSemaphore, primaryCommandBuffer);
}

If you have any idea on what am I missing / what i'm doing wrong, please let me know.

Here is the full VS 2017 project if anyone wants to play with it :D

I know it's a MESS, but I'm just learning Vulkan.

1
You probably have some piece of code that the original doesn't have and which causes one thread to obstruct the other. Typical candidates are shared resources that are too sparse, thread convoys, cache pingpong between cores. From the info you provide, it's impossible to tell. One thing I would do is compare the two codes with a single thread in the threadpool. How does that perform in comparison? - Ulrich Eckhardt
Yeah, you're right, these code snippets don't tell much, I've provided the full project. Comparing with only one thread gives much slower performance in my case, you can see in the table above. - Zoltán
There's Amdahl's law. How big speedup do you actually expect? There you actually have relatively massive speedup up to four threads in your table. If you have quad-core CPU, that makes sense. How is it "slower"? 40 FPS is faster than 30 FPS. - krOoze
I have a 12 core CPU and I don't really understand why I got stuck at 40 with that. - Zoltán
Thanks for your comments, it seems I've found the problem - Zoltán

1 Answers

4
votes

It seems that I've found the problem: I have left the validation layers enabled. I disabled it, and the performance increased a lot, I've updated the table in the question with a 4th row for comparison. Who knew that validation layers eat up so much run time. If anyone wants to measure Vulkan's performance, don't forget to disable it!