C structures with dynamic data with CUDA kernels?

Question

Lets say I have a data structure:

struct MyBigData {
    float * dataArray;
    float * targetArray;
    float * nodes;
    float * dataDataData;
}

I would like to be able to pass this structure around some various CUDA kernels. I don't want to have to pass multiple arrays as arguments, so can I just pass the structure and be done with it? I know the kernels support C structures, but how about dynamic memory in the C structures?

It seems I would just do this to make the structure on the CUDA card:

MyBigData * mbd = (MyBigData *) cudaMalloc( sizeof(MyBigData) );

But how about the dynamic memory for the arrays in the structure? This line below compiles but has a run-time error:

mbd->dataArray = (float *) cudaMalloc( 10 * sizeof(float) );

This is because cudaMalloc() runs on the CPU, and it cannot read the mdb->dataArray to set the pointer equal to the new memory address. So there's a run-time error. However, this compiles and runs, but doesn't seem to be what I want:

MyBigData * mbd = (MyBigData *) malloc( sizeof(myBigData) );
mbd->dataArray = (float *) cudaMalloc( 10 * sizeof(float) );

Because now, although this is valid, now mbd resides on the main system memory, and the float pointer points to memory allocated on the CUDA device. So I can't just pass a pointer to the MyBigData structure, I have to pass each variable in the structure to the kernel individually. Not clean. What I want is:

someKernel<<<1,1>>>(mbd);

Not:

someKernel<<<1,1>>>(mbd->dataArray, mbd->targetArray, mbd->nodes, mbd->dataDataData);

So I was thinking, how about cudaMemcpy()? I was thinking of this:

MyBigData *d_mbd = cudaMemcpy( (void*) &d_mbd, (void*) mbd, SOMESIZE, CudaHostToDevice);

But then what do I put for SOMESIZE? I can't use sizeof(MyBigData), because that will include the size of float pointers, not the actual size of the arrays. Second, is cudaMemcpy() even smart enough to dig down into sub-objects of a complicated data structure? I think not.

So, is it impossible to have a structure containing dynamic memory on the CUDA card? Or am I missing something. The easy way would be to have a CUDA kernel allocate some memory, but you can't call cudaMalloc() from a CUDA kernel.

Thoughts?

UPDATE 7 May: I wrote this code, and it compiles, but it tells me all the values are zero. I think I am creating the object correctly and populating the values properly with the CUDA Kernel. The values are just the thread ID. I suspect I'm not printing the values properly. Thoughts? And thank you!

MyBigData* generateData(const int size) {
    MyBigData *mbd_host, *mbd_cuda;
    mbd_host = (MyBigData *) malloc( sizeof(MyBigData) );
    cudaMalloc( (void**) &mbd_host->dataArray, size * sizeof(float) );
    cudaMalloc( (void**) &mbd_host->targetArray, size * sizeof(float) );
    cudaMalloc( (void**) &mbd_host->nodes, size * sizeof(float) );
    cudaMalloc( (void**) &mbd_host->dataDataData, size * sizeof(float) );
    cudaMalloc( (void**) &mbd_cuda, sizeof(MyBigData) );
    cudaMemcpy( mbd_cuda, mbd_host, sizeof(mbd_host), cudaMemcpyHostToDevice );
    free(mbd_host);
    return mbd_cuda;
}

void printCudaData(MyBigData* mbd_cuda, const int size) {
    MyBigData *mbd;
    cudaMemcpy( mbd, mbd_cuda, sizeof(mbd_cuda), cudaMemcpyDeviceToHost);
    MyBigData *mbd_host = (MyBigData *) malloc( sizeof(MyBigData));
    mbd_host->dataArray = (float*) malloc(size * sizeof(float));
    mbd_host->targetArray = (float*) malloc(size * sizeof(float));
    mbd_host->nodes = (float*) malloc(size * sizeof(float));
    mbd_host->dataDataData = (float*) malloc(size * sizeof(float));

    cudaMemcpy( mbd_host->dataArray, mbd->dataArray, size * sizeof(float), cudaMemcpyDeviceToHost);
    cudaMemcpy( mbd_host->targetArray, mbd->targetArray, size * sizeof(float), cudaMemcpyDeviceToHost);
    cudaMemcpy( mbd_host->nodes, mbd->nodes, size * sizeof(float), cudaMemcpyDeviceToHost);
    cudaMemcpy( mbd_host->dataDataData, mbd->dataDataData, size * sizeof(float), cudaMemcpyDeviceToHost);

    for(int i = 0; i < size; i++) {
        printf("data[%i] = %f\n", i, mbd_host->dataArray[i]);
        printf("target[%i] = %f\n", i, mbd_host->targetArray[i]);
        printf("nodes[%i] = %f\n", i, mbd_host->nodes[i]);
        printf("data2[%i] = %f\n", i, mbd_host->dataDataData[i]);
    }

    free(mbd_host->dataArray);
    free(mbd_host->targetArray);
    free(mbd_host->nodes);
    free(mbd_host->dataDataData);
    free(mbd_host);
}

This is my Kernel and the function that calls it:

__global__ void cudaInitData(MyBigData* mbd) {
    const int threadID = threadIdx.x;
    mbd->dataArray[threadID] = threadID;
    mbd->targetArray[threadID] = threadID;
    mbd->nodes[threadID] = threadID;
    mbd->dataDataData[threadID] = threadID;
}

void initData(MyBigData* mbd, const int size) {
    if (mbd == NULL)
        mbd = generateData(size);

    cudaInitData<<<size,1>>>(mbd);
}

My main() calls:

MyBigData* mbd = NULL;
initData(mbd, 10);
printCudaData(mbd, 10);

I am not a CUDA developer, but it sounds like what you're describing would very much not be possible the way you've described- when you're sharing pointers between two discreet memory blocks, things are just not going to work. The memcopy family of functions want a continuous block of data, which you don't have. What I am curious about is the constant 10- if your arrays are always length 10, why not build your data structure to be 4 * ((sizeof(float*) + (10 * sizeof(float)))? — David Souther

aland aland · Accepted Answer · 2012-05-05T07:21:34

Second, is cudaMemcpy() even smart enough to dig down into sub-objects of a complicated data structure? I think not.

You re right, cudaMemcpy() does not make recursive copy. To achieve what you want, you should do something like this:

// Create mbd on host
MyBigData *mbd_host, *mbd;
mbd_host = (MyBigData *) malloc( sizeof(myBigData) );
// Fill it with pointers to device arrays
cudaMalloc( &mbd_host->dataArray, 10 * sizeof(float) );
// etc for other structure fields
// Create mbd on device
cudaMalloc( &mbd, sizeof(MyBigData) );
// Copy structure, filled with device addresses, to device memory
cudaMemcpy( mbd, mbd_host, sizeof(mbd), cudaMemcpyHostToDevice );
// Voila!

By the way, it's probably a good idea to store you MyBigData structure not in __global__, but in __constant__ memory of the device (you would have to declare a constant instead of allocating mbd with cudaMalloc and use cudaMemcpyToSymbol instead of last cudaMemcpy)

C structures with dynamic data with CUDA kernels?

1 Answers