cuda and opengl - efficiency problem

Question

I wonder why a way to initialize vbo make a big difference in fps when interacting with cuda. When I create vbo there are two possibilities:

vbo reserves only memory space with the given data size (in this case positions of the particles are first time write to vbo within the kernel and later modify in the kernel):
```
gl.glBufferData(GL3.GL_ARRAY_BUFFER, n_particles * 4 * Sizeof.FLOAT, null, GL3.GL_DYNAMIC_DRAW);
```
vbo reserves memory space with the given data size and get some initial data (positions of the particles - ofcourse these values are later modify in the kernel)
```
gl.glBufferData(GL3.GL_ARRAY_BUFFER, n_particles * 4 * Sizeof.FLOAT, FloatBuffer.wrap(particlesPositions), GL3.GL_DYNAMIC_DRAW);
```

1.~408 fps 2. ~75 fps

You can check this behaviour using a Simple OpenGL example from Nvidia GPU Computing SDK.

Nicol Bolas Nicol Bolas · Accepted Answer · 2011-08-18T22:10:57

Because the first case doesn't have to upload data to the GPU. The second case does.

It's the difference between:

void *memory = malloc(size);

and

void *memory = malloc(size);
memcpy(memory, data, size);

The first is necessarily faster than the second.

Also, you may wish to use GL_STREAM_DRAW instead of GL_DYNAMIC_DRAW if you're frequently calling glBufferData on the same buffer object.