Memory layout mismatching between CPU and GPU code with CUDA

Question

I'm experiencing a very weird situation. I have this template structures:

#ifdef __CUDACC__
#define __HOSTDEVICE __host__ __device__
#else
#define __HOSTDEVICE
#endif

template <typename T>
struct matrix
{
    T* ptr;
    int col_size, row_size;
    int stride;
    // some host & device methods
};

struct dummy1 {};
struct dummy2 : dummy1 {};

template <typename T>
struct a_functor : dummy2
{
    matriz<T> help_m;
    matrix<T> x, y;
    T *x_ptr, *y_ptr;
    int bsx, ind_thr;
    __HOSTDEVICE void operator()(T* __x, T* __y)
    {
        // functor code
    }
};

I've structured my code to separate cpp and cu files, so a_functor object is created in cpp file and used in a kernel function. The problem is that, executing operator() inside a kernel, I found some random behaviour I couldn't explain only looking at code. It was like my structs were sort of corrupted. So, calling a sizeof() on an a_functor object, I found:

CPU code (.cpp and .cu outside kernel): 64 bytes
GPU code (inside kernel): 68 bytes

There was obviously some kind of mismatching that ruined the whole stuff. Going further, I tracked the distance between struct parameter pointers and struct itself - to try to inspect the produced memory layout - and here's what I found:

a_functor foo;
// CPU
(char*)(&foo.help_m)    - (char*)(&foo)   = 0
(char*)(&foo.x)         - (char*)(&foo)   = 16
(char*)(&foo.y)         - (char*)(&foo)   = 32
(char*)(&foo.x_ptr)     - (char*)(&foo)   = 48
(char*)(&foo.y_ptr)     - (char*)(&foo)   = 52
(char*)(&foo.bsx)       - (char*)(&foo)   = 56
(char*)(&foo.ind_thr)   - (char*)(&foo)   = 60

// GPU - inside a_functor::operator(), in-kernel
(char*)(&this->help_m)  - (char*)(this)   = 4
(char*)(&this->x)       - (char*)(this)   = 20
(char*)(&this->y)       - (char*)(this)   = 36
(char*)(&this->x_ptr)   - (char*)(this)   = 52
(char*)(&this->y_ptr)   - (char*)(this)   = 56
(char*)(&this->bsx)     - (char*)(this)   = 60
(char*)(&this->ind_thr) - (char*)(this)   = 64

I really can't understand why nvcc generated this memory layout for my struct (what are that 4 bytes supposed to be/do!?!). I thought it could be an alignment problem and I tryed to explicitly align a_functor, but I can't because it is passed by value in kernel

template <typename T, typename Str>
__global__ void mykernel(Str foo, T* src, T*dst);

and when I try compile I get

error: cannot pass a parameter with a too large explicit alignment to a global routine on win32 platforms

So, to solve this strange situation (...and I do think that's an nvcc bug), what should I do? The only thing I can think of is playing with alignment and passing my struct to kernel by pointer to avoid the aforementioned error. However, I'm really wondering: why that memory layout mismatching?! It really makes no sense...

Further information: I'm using Visual Studio 2008, compiling with MSVC on Windows XP 32bit platform. I installed the latest CUDA Toolkit 5.0.35. My card is a GeForce GTX 570 (compute capability 2.0).

I built a simple app around the code you have posted. I tested this app on a linux 64 bit platform and a win7 32 bit platform, and was not able to reproduce a difference in the size of the object between the host code and kernel code in either case. In the 64 bit case, the object size was 96 bytes in both host and kernel case, and on the 32 bit win7 platform it was 64 bytes in both host and kernel case. Can you post a simple, self-contained, complete reproducer, along the lines of my test code here? — Robert Crovella
Dear Robert, I posted a closer-to-reality version here: pastebin.com/qdPNa2Zs. Executing that code (mine & yours), I couldn't see the strange behaviour I have in my real code. But, even more strange, I tried to remove from real code the equivalent of that : dummy2 (i.e. derivation from a void struct). I needed it, but I'll do differently without pain. You know what? This apparently useless move solved my problem! I mean... maybe that 4bytes were "room" for a void base struct! The whole thing is still unexplicable, to me. — biagiop1986
PS My apologize if I forgot to add that derivation in the simplified code I posted above. Now that I know it was the source of all pains, I'm really sorry I didn't post it in the reduced version, but I corrected it accordingly. — biagiop1986
I'm still unable to reproduce the problem with the modified code you have posted now, on either 32 bit or 64 bit. The sizes are unchanged. If you're unable to demonstrate the issue, I doubt I can be of any assistance. — Robert Crovella

Tom Tom · Accepted Answer · 2013-05-20T09:21:22

From the comments it appears there may be differences between the code you're actually running and the code you've posted, so it's difficult to give more than vague answers without someone being able to reproduce the problem. That said, on Windows there are cases where the layout and size of a struct can differ between the CPU and the GPU, these are documented in the programming guide:

On Windows, the CUDA compiler may produce a different memory layout, compared to the host Microsoft compiler, for a C++ object of class type T that satisfies any of the following conditions:

T has virtual functions or derives from a direct or indirect base class that has virtual functions;

T has a direct or indirect virtual base class;

T has multiple inheritance with more than one direct or indirect empty base class.

The size for such an object may also be different in host and device code. As long as type T is used exclusively in host or device code, the program should work correctly. Do not pass objects of type T between host and device code (e.g., as arguments to global functions or through cudaMemcpy*() calls).

The third case may apply in your case where you have an empty base class, do you have multiple inheritance in the real code?

Memory layout mismatching between CPU and GPU code with CUDA

1 Answers