Vector on CUDA working on the Kernel

Question

I am going to implement a way of creting 3D models on the GPU using CUDA. I did this several years ago, but I guess CUDA has developed since then, so I try to get some input on how to best do what I will do.

In my C++ version I have a vector of Voxels, where Voxel is a struct containing float numbers. The vector is supposed to represent an entire grid where I will do computations on each voxel independently.

Earlier, I had to used pointers and cudaMalloc and so forth in order to be able to access the voxels on the device. I am thinking about if there are some new features I can use.

Is there something like vector you can use on the actual kernel? Thrust is not suitable since it is supposed to be called from host.

More interestingly, is it possible to do dynamic memory allocation on the device, so that I could implement something like octrees on the GPU?

That would allow for larger scale reconstructions.

Any ideas are appreciated!

Edit:

It seems one has to stick with classic c-style coding using pointers and cudaMalloc, but dynamic memory allocation is possible.

Say I have this struct:

struct Data {
     float *p;
 }

and I start with an array

Data data[10];

Then I want later to allocate an array of 30 floats in data[2] you would do something like

data[2].p = (float*)malloc(30*sizeof(float));

How would the code look like on Cuda?

Regarding your edit, if this is device code (kernel code) you're talking about, it would look exactly the same. — Robert Crovella
Ok, but then you need some tricks to get it back to the host, since the host cannot see what you have allocated and what not? — El_Loco

Unknown Unknown · Accepted Answer · 2015-08-17T13:48:24

Is there something like vector you can use on the actual kernel?

Not really, no.
is it possible to do dynamic memory allocation on the device, so that I could implement something like octrees on the GPU?

Yes, dynamic memory allocation in device code was been supported on compute capability >= 2.0 devices for a number of years. Note that device heap memory allocation isn't particularly fast, so unless you have code which will re-use whatever allocations you make, there is going to be a performance penalty. Also note that you currently cannot access device heap from the host APIs, so if you need to transfer data back to the host, you need to do some additional work in a transfer kernel to move data from heap to global memory or a host zero copy/managed buffer.

Vector on CUDA working on the Kernel

1 Answers