OpenGL ES(WebGL) rendering many small objects

Question

I need to render a lot of small objects (2 - 100 triangles in size) which lies in deep hierarchy and each object has its own matrix. In order to render them I precalculate actual matrix for each object, put objects in a single list and I have two calls to draw each object: set matrix uniform and gl.drawElements().

Obviously it is not the fastest way to go. Then I have a few thousand objects performance becomes unacceptable. The only solution I'm thinking about is batch multiple objects into single buffer. But it isn't easy thing to do because each object has its own matrix and to put object into shared buffer I need to transform its vertices by matrix on CPU. Even worse problem is that user can move any objects at any time and I need to recalculate large vertex data again (because user can move object with many nested children)

So I'm looking for alternative approaches. And recently found strange vertex shaders in onshape.com project:

uniform mat4 uMVMatrix;
uniform mat3 uNMatrix;
uniform mat4 uPMatrix;
 
uniform vec3 uSpecular;
uniform float uOpacity;
uniform float uColorAmbientFactor;  //Determines how much of the vertex-specified color to use in the ambient term
uniform float uColorDiffuseFactor;  //Determines how much of the vertex-specified color to use in the diffuse term
 
uniform bool uApplyTranslucentAlphaToAll;
uniform float uTranslucentPassAlpha;
 
attribute vec3 aVertexPosition;
attribute vec3 aVertexNormal;
attribute vec2 aTextureCoordinate;
attribute vec4 aVertexColor;
 
varying vec3 vPosition;
varying lowp vec3 vNormal;
varying mediump vec2 vTextureCoordinate;
varying lowp vec3 vAmbient;
varying lowp vec3 vDiffuse;
varying lowp vec3 vSpecular;
varying lowp float vOpacity;
 
attribute vec4 aOccurrenceId;
 
float unpackOccurrenceId() {
  return aOccurrenceId.g * 65536.0 + aOccurrenceId.b * 256.0 + aOccurrenceId.a;
}
 
float unpackHashedBodyId() {
  return aOccurrenceId.r;
}
 
#define USE_OCCURRENCE_TEXTURE 1
 
#ifdef USE_OCCURRENCE_TEXTURE
 
uniform sampler2D uOccurrenceDataTexture;
uniform float uOccurrenceTexelWidth;
uniform float uOccurrenceTexelHeight;
#define ELEMENTS_PER_OCCURRENCE 2.0
 
void getOccurrenceData(out vec4 occurrenceData[2]) {
  // We will extract the occurrence data from the occurrence texture by converting the occurrence id to texture coordinates
 
  // Convert the packed occurrenceId into a single number
  float occurrenceId = unpackOccurrenceId();
 
  // We first determine the row of the texture by dividing by the overall texture width.  Each occurrence
  // has multiple rgba texture entries, so we need to account for each of those entries when determining the
  // element's offset into the buffer.
  float divided = (ELEMENTS_PER_OCCURRENCE * occurrenceId) * uOccurrenceTexelWidth;
  float row = floor(divided);
  vec2 coordinate;
  // The actual coordinate lies between 0 and 1.  We need to take care that coordinate lies on the texel
  // center by offsetting the coordinate by a half texel.
  coordinate.t = (0.5 + row) * uOccurrenceTexelHeight;
  // Figure out the width of one texel in texture space
  // Since we've already done the texture width division, we can figure out the horizontal coordinate
  // by adding a half-texel width to the remainder
  coordinate.s = (divided - row) + 0.5 * uOccurrenceTexelWidth;
  occurrenceData[0] = texture2D(uOccurrenceDataTexture, coordinate);
  // The second piece of texture data will lie in the adjacent column
  coordinate.s += uOccurrenceTexelWidth;
  occurrenceData[1] = texture2D(uOccurrenceDataTexture, coordinate);
}
 
#else
 
attribute vec4 aOccurrenceData0;
attribute vec4 aOccurrenceData1;
void getOccurrenceData(out vec4 occurrenceData[2]) {
  occurrenceData[0] = aOccurrenceData0;
  occurrenceData[1] = aOccurrenceData1;
}
 
#endif
 
/**
 * Create a model matrix from the given occurrence data.
 *
 * The method for deriving the rotation matrix from the euler angles is based on this publication:
 * http://www.soi.city.ac.uk/~sbbh653/publications/euler.pdf
 */
mat4 createModelTransformationFromOccurrenceData(vec4 occurrenceData[2]) {
  float cx = cos(occurrenceData[0].x);
  float sx = sin(occurrenceData[0].x);
  float cy = cos(occurrenceData[0].y);
  float sy = sin(occurrenceData[0].y);
  float cz = cos(occurrenceData[0].z);
  float sz = sin(occurrenceData[0].z);
 
  mat4 modelMatrix = mat4(1.0);
 
  float scale = occurrenceData[0][3];
 
  modelMatrix[0][0] = (cy * cz) * scale;
  modelMatrix[0][1] = (cy * sz) * scale;
  modelMatrix[0][2] = -sy * scale;
 
  modelMatrix[1][0] = (sx * sy * cz - cx * sz) * scale;
  modelMatrix[1][1] = (sx * sy * sz + cx * cz) * scale;
  modelMatrix[1][2] = (sx * cy) * scale;
 
  modelMatrix[2][0] = (cx * sy * cz + sx * sz) * scale;
  modelMatrix[2][1] = (cx * sy * sz - sx * cz) * scale;
  modelMatrix[2][2] = (cx * cy) * scale;
 
  modelMatrix[3].xyz = occurrenceData[1].xyz;
 
  return modelMatrix;
}
 
 
void main(void) {
  vec4 occurrenceData[2];
  getOccurrenceData(occurrenceData);
  mat4 modelMatrix = createModelTransformationFromOccurrenceData(occurrenceData);
  mat3 normalMatrix = mat3(modelMatrix);
 
  vec4 position = uMVMatrix * modelMatrix * vec4(aVertexPosition, 1.0);
  vPosition = position.xyz;
  vNormal = uNMatrix * normalMatrix * aVertexNormal;
  vTextureCoordinate = aTextureCoordinate;
 
  vAmbient = uColorAmbientFactor * aVertexColor.rgb;
  vDiffuse = uColorDiffuseFactor * aVertexColor.rgb;
  vSpecular = uSpecular;
  vOpacity = uApplyTranslucentAlphaToAll ? (min(uTranslucentPassAlpha, aVertexColor.a)) : aVertexColor.a;
 
  gl_Position = uPMatrix * position;
}

It looks like they encode object position and rotation angles as 2 entries in 4-component float texture, add attribute that stores position of each vertex transform in this texture and then perform matrix computation in vertex shader.

So the question is this shader actually effective solution for my problem or I should better use batching or something else?

PS: May be even better approach is to store quaternion instead of angles and transform vertices by it directly?

WacławJasper WacławJasper · Accepted Answer · 2016-01-07T22:44:28

I was curious about this too so I run a couple tests with 4 different drawing techniques.

The first is instancing via uniform that you found in most tutorials and books. For each model, set uniforms, then draw model.

The second is to store an additional attribute, the matrix transform on each vertex and do transforms on the GPU. On each draw, do gl.bufferSubData then draw as many models as possible in each draw.

The third approach is to upload multiple matrix transforms as uniform to the GPU and have an additional matrixID on each vertex to select the right matrix on the GPU. This is similar to the first, except it allows models to be drawn in batches. This is also how it is usually implemented in skeleton animations. On draw time, for each batch, upload matrix from the model at batch[index] to the matrix array[index] in the GPU and draw the batch.

The final technique is via texture lookup. I created a Float32Array of size 4096 * 256 * 4 which contains the world matrix for every model (enough for ~256k models). Each model has a modelIndex attribute which is used to read its matrix from the texture. Then at each frame, gl.texSubImage2D the entire texture and draw as many as possible in each draw call.

Hardware instancing is not considered as I assume the requirement is to draw many unique models even though for my test I am only drawing cubes that have a different world matrix every frame.

Here is the results: (how many can be drawn at 60FPS)

Different uniform per model: ~2000
Batched uniforms with matrixId: ~20000
Store transforms per vertex: ~40000 (found a bug with first implementation)
Texture lookup: ~160000
No draw, just CPU time to calculate the matrices: ~170000

I think its pretty obvious that uniform instancing is not the way the go. Technique 1 fails just because its doing too many draw calls. Batched uniforms should supposedly handle the draw call problem but I found too much CPU time is used on getting the matrix data from the right model and uploading it to GPU. The numerous uniformMatrix4f calls didnt help either.

The time it takes to do gl.texSubImage2D is significantly less compared to the time it takes to calculate new world matrices for dynamic objects. Duplicating transform data on each vertex works better than what most people might think but it is wasting a lot of memory bandwidth. Texture lookup approach is probably most friendly to CPU out of all above techniques. The speed of doing 4 texture lookup seems to be similar to doing a uniform array lookup. (results from testing with larger complex objects in which I am GPU bound).

Here is a snapshot from one of the tests using the texture lookup approach:

So, in conclusion, what you are after is probably either store the transformation data on each vertex if your models are small or use the texture lookup approach when your models are big.

Answers to questions in the comments:

fillrate: I am not bound by GPU at all. When I tried with large complex models , uniform instancing actually became the fastest. I guess there is some GPU overhead with using uniform batching and texture lookup which led to them being slower.
Store quaternion and translation: wouldnt have mattered much in my case, because as you can see texSubImage2D only took 9% of the CPU time, decreasing that to 4.5% wouldnt matter much. Hard to say on its effect on GPU, since while you are doing less texture lookup but you have to convert quaternion and translation into matrix.
Interleaving: Supposely this technique can give about 5-10% speed up if your app is vertex bound. However, I have never seen interleaving made a difference for me in my tests. So I have gotten rid of it entirely.
Memory: it is basically the same for all the techniques except for duplication on each vertex. All other 3 techniques should pass in the same amount of data to the GPU. (you could pass in translation + quaternion as uniform instead of matrix)

OpenGL ES(WebGL) rendering many small objects

2 Answers