Behaviour of passing struct as a parameter to a CUDA kernel

Question

I'm relatively new to CUDA programming, so I want to clarify the behaviour of a struct when I pass it into a kernel. I've defined the following struct to somewhat imitate the behavior of a 3D array that knows its own size:

struct protoarray {
    size_t dim1;
    size_t dim2;
    size_t dim3;
    float* data;
};

I create two variables of type protoarray, dynamically allocate space to data via malloc and cudaMalloc on the host and device side, and update dim1, dim2 and dim3 to reflect the size of array I want this struct to represent. I read in this thread that the struct should be passed via copy. So this is what I do in my kernel

__global__ void kernel(curandState_t *state, protoarray arr_device){
    const size_t dim1 = arr_device.dim1;
    const size_t dim2 = arr_device.dim2;
    
    for(size_t j(0); j < dim2; j++){
        for(size_t i(0); i < dim1; i++){
            // Do something 
        }
    }
}

The struct is passed by copy, so all its contents are copied into shared memory of each block. This is where I'm getting bizarre behaviour, which I'm hoping you could help me with. Suppose I had set arr_device.dim1 = 2 on the host side. While debugging inside the kernel and setting a breakpoint at one of the for loops, checking the value of arr_device.dim1 yields something like 16776576, nowhere large enough to cause overflow, but this value copies correctly into dim1 as 2, which means that the for loops execute as I intended them to. As a side question, is using size_t which is essential unsigned long long int bad practice, seeing as the GPU's are made of 32bit cores?

Generally, how safe is it to pass struct and class into kernels as arguments, is bad practice that should be avoided at all cost? I imagine that passing pointers to classes to kernels is difficult in case they contain members which point to dynamically allocated memory, and that they should be very lightweight if I want to pass them by value.

Are you sure size_t is the same size on both the CPU and GPU? (And structs without constructors are generally safe, I believe.) — Ken Y-N
@KenY-N: CUDA guarantees that sizof(size_t) and sizeof(void *) are the same on the GPU and device for whatever platform is being used. It might not be the same from platform to platform — talonmies
This worries me -- "The struct is passed by copy, so all its contents are copied into shared memory of each block". That isn't at all how things work. And whatever the problem is, it isn't being caused by anything you have shown in this question. minimal reproducible example, please — talonmies
@KenY-N Changed all to unsigned int, and now the values match. Could you please refer me to some source which refers to the issue of struct's with constructors in kernels? @talonmies ""The struct is passed by copy, so all its contents are copied into shared memory of each block" this is essentially a citation from the NVIDIA developer forum thread I linked. — konovification
@SlavaK.: OK, that was true in 2009. It hasn't been true since 2010. All kernel arguments go in a dedicated constant memory bank on all CUDA hardware except the very first generation G80/G90 parts — talonmies

einpoklum einpoklum · Accepted Answer · 2021-03-12T19:44:27

This is a partial answer, since without a proper program to look into, it is difficult/impossible to guess why you would see an invalid value in your arr_device.dim1.

The struct is passed by copy, so all its contents are copied into shared memory of each block.

Incorrect. Kernel arguments are stored in constant memory, which is device-global and not block-specific. They are not stored shared memory (which is block-specific).

When a thread runs, it typically reads arguments from constant memory into registers (and again, not shared memory).

Generally, how safe is it to pass struct and class into kernels as arguments

My personal rule of thumb on this matter is: If the struct/class...

is trivially-copyable; and
all its members of the struct/class are defined both for the host and the device side, or at least - designed with GPU use in mind;

then it should be safe to pass to a kernel.

passing struct and class into kernels as arguments [ - ] is [it] bad practice that should be avoided at all cost?

No. But remember that most C++ libraries only provide host-side code; and were not written with a mind of being used on a GPU. So I'd be wary of using non-trivial classes without a lot of scrutiny.

I imagine that passing pointers to classes to kernels is difficult in case they contain members which point to dynamically allocated memory

Yes, this can be problematic. However - if you used cuda::memory::managed::allocate(), cuda::memory::managed::make_unique() or cudaMallocManaged() - then this should "just work", i.e. the relevant memory pages will be fetched to the GPU or the CPU as necessary when accessed. See:

and that they should be very lightweight if I want to pass [objects to kernels] by value.

Yes, because each and every thread has to read each argument from constant memory before it can use that argument. And while constant memory allows this to happen relatively quickly, it's still a bunch of overhead that you want to minimize.

Also remember that you can't pass anything to kernels by (C++) reference; it's all "by-value" - the object itself or a pointer to it.

Behaviour of passing struct as a parameter to a CUDA kernel

1 Answers