Chunk Rendering in Metal

Question

I'm trying to create a procedural game using Metal, and I'm using an octree based chunk approach for a Level of Detail implementation.

The method I'm using involves the CPU creating the octree nodes for the terrain, which then has its mesh created on the GPU using a compute shader. This mesh is stored in a vertex buffer and index buffer in the chunk object for rendering.

All of this seems to work fairly well, however when it comes to rendering chunks I'm hitting performance issues early on. Currently I gather an array of chunks to draw, then submit that to my renderer, that will create an MTLParallelRenderCommandEncoder to then create an MTLRenderCommandEncoder for each chunk, which is then submitted to the GPU.

By the looks of it around 50% of the CPU time is spent on creating the MTLRenderCommandEncoder for each chunk. Currently I'm just creating a simple 8 vertex cube mesh for each chunk, and I have an 4x4x4 array of chunks and I'm dropping to around 50fps in these early stages. (In reality it seems that there can only be up to 63 MTLRenderCommandEncoder in each MTLParallelRenderCommandEncoder so it's not fully 4x4x4)

I've read that the point of the MTLParallelRenderCommandEncoder is to create each MTLRenderCommandEncoder in a separate thread, yet I've not had much luck with getting this to work. Also multithreading it wouldn't get around the cap of 63 chunks being rendered as a max.

I feel that somehow consolidating the vertex and index buffers for each chunk into one or two larger buffers for submission would help, but I'm not sure how to do this without copious memcpy() calls and whether or not this would even improve efficiency.

Here's my code that takes in the array of nodes and draws them:

func drawNodes(nodes: [OctreeNode], inView view: AHMetalView){
  // For control of several rotating buffers
  dispatch_semaphore_wait(displaySemaphore, DISPATCH_TIME_FOREVER)

  makeDepthTexture()

  updateUniformsForView(view, duration: view.frameDuration)
  let commandBuffer = commandQueue.commandBuffer()


  let optDrawable = layer.nextDrawable()

  guard let drawable = optDrawable else{
    return
  }

  let passDescriptor = MTLRenderPassDescriptor()

  passDescriptor.colorAttachments[0].texture = drawable.texture
  passDescriptor.colorAttachments[0].clearColor = MTLClearColorMake(0.2, 0.2, 0.2, 1)
  passDescriptor.colorAttachments[0].storeAction = .Store
  passDescriptor.colorAttachments[0].loadAction = .Clear

  passDescriptor.depthAttachment.texture = depthTexture
  passDescriptor.depthAttachment.clearDepth = 1
  passDescriptor.depthAttachment.loadAction = .Clear
  passDescriptor.depthAttachment.storeAction = .Store

  let parallelRenderPass = commandBuffer.parallelRenderCommandEncoderWithDescriptor(passDescriptor)

  // Currently 63 nodes as a maximum
  for node in nodes{
    // This line is taking up around 50% of the CPU time
    let renderPass = parallelRenderPass.renderCommandEncoder()

    renderPass.setRenderPipelineState(renderPipelineState)
    renderPass.setDepthStencilState(depthStencilState)
    renderPass.setFrontFacingWinding(.CounterClockwise)
    renderPass.setCullMode(.Back)

    let uniformBufferOffset = sizeof(AHUniforms) * uniformBufferIndex

    renderPass.setVertexBuffer(node.vertexBuffer, offset: 0, atIndex: 0)
    renderPass.setVertexBuffer(uniformBuffer, offset: uniformBufferOffset, atIndex: 1)

    renderPass.setTriangleFillMode(.Lines)

    renderPass.drawIndexedPrimitives(.Triangle, indexCount: AHMaxIndicesPerChunk, indexType: AHIndexType, indexBuffer: node.indexBuffer, indexBufferOffset: 0)

    renderPass.endEncoding()
  }
  parallelRenderPass.endEncoding()

  commandBuffer.presentDrawable(drawable)

  commandBuffer.addCompletedHandler { (commandBuffer) -> Void in
    self.uniformBufferIndex = (self.uniformBufferIndex + 1) % AHInFlightBufferCount
    dispatch_semaphore_signal(self.displaySemaphore)
  }

  commandBuffer.commit()
}

rickster rickster · Accepted Answer · 2015-12-02T20:47:10

You note:

I've read that the point of the MTLParallelRenderCommandEncoder is to create each MTLRenderCommandEncoder in a separate thread...

And you're correct. What you're doing is sequentially creating, encoding with, and ending command encoders — there's nothing parallel going on here, so MTLParallelRenderCommandEncoder is doing nothing for you. You'd have roughly the same performance if you eliminated the parallel encoder and just created encoders with renderCommandEncoderWithDescriptor(_:) on each pass through your for loop... which is to say, you'd still have the same performance problem due to the overhead of creating all those encoders.

So, if you're going to encode sequentially, just reuse the same encoder. Also, you should reuse as much of your other shared state as possible. Here's a quick pass at a possible refactoring (untested):

let passDescriptor = MTLRenderPassDescriptor()

// call this once before your render loop
func setup() {
    makeDepthTexture()

    passDescriptor.colorAttachments[0].clearColor = MTLClearColorMake(0.2, 0.2, 0.2, 1)
    passDescriptor.colorAttachments[0].storeAction = .Store
    passDescriptor.colorAttachments[0].loadAction = .Clear

    passDescriptor.depthAttachment.texture = depthTexture
    passDescriptor.depthAttachment.clearDepth = 1
    passDescriptor.depthAttachment.loadAction = .Clear
    passDescriptor.depthAttachment.storeAction = .Store

    // set up render pipeline state and depthStencil state
}

func drawNodes(nodes: [OctreeNode], inView view: AHMetalView) {

    updateUniformsForView(view, duration: view.frameDuration)

    // Set up completed handler ahead of time
    let commandBuffer = commandQueue.commandBuffer()
    commandBuffer.addCompletedHandler { _ in // unused parameter
        self.uniformBufferIndex = (self.uniformBufferIndex + 1) % AHInFlightBufferCount
        dispatch_semaphore_signal(self.displaySemaphore)
    }

    // Semaphore should be tied to drawable acquisition
    dispatch_semaphore_wait(displaySemaphore, DISPATCH_TIME_FOREVER)
    guard let drawable = layer.nextDrawable()
        else { return }

    // Set up the one part of the pass descriptor that changes per-frame
    passDescriptor.colorAttachments[0].texture = drawable.texture

    // Get one render pass descriptor and reuse it
    let renderPass = commandBuffer.renderCommandEncoderWithDescriptor(passDescriptor)
    renderPass.setTriangleFillMode(.Lines)
    renderPass.setRenderPipelineState(renderPipelineState)
    renderPass.setDepthStencilState(depthStencilState)

    for node in nodes {
        // Update offsets and draw
        let uniformBufferOffset = sizeof(AHUniforms) * uniformBufferIndex
        renderPass.setVertexBuffer(node.vertexBuffer, offset: 0, atIndex: 0)
        renderPass.setVertexBuffer(uniformBuffer, offset: uniformBufferOffset, atIndex: 1)
        renderPass.drawIndexedPrimitives(.Triangle, indexCount: AHMaxIndicesPerChunk, indexType: AHIndexType, indexBuffer: node.indexBuffer, indexBufferOffset: 0)

    }
    renderPass.endEncoding()

    commandBuffer.presentDrawable(drawable)
    commandBuffer.commit()
}

Then, profile with Instruments to see what, if any, further performance issues you might have. There's a great WWDC 2015 session about that showing several of the common "gotchas", how to diagnose them in profiling, and how to fix them.

Chunk Rendering in Metal

1 Answers