I'm experiencing a very weird situation. I have this template structures:
#ifdef __CUDACC__
#define __HOSTDEVICE __host__ __device__
#else
#define __HOSTDEVICE
#endif
template <typename T>
struct matrix
{
T* ptr;
int col_size, row_size;
int stride;
// some host & device methods
};
struct dummy1 {};
struct dummy2 : dummy1 {};
template <typename T>
struct a_functor : dummy2
{
matriz<T> help_m;
matrix<T> x, y;
T *x_ptr, *y_ptr;
int bsx, ind_thr;
__HOSTDEVICE void operator()(T* __x, T* __y)
{
// functor code
}
};
I've structured my code to separate cpp and cu files, so a_functor object is created in cpp file and used in a kernel function. The problem is that, executing operator() inside a kernel, I found some random behaviour I couldn't explain only looking at code. It was like my structs were sort of corrupted. So, calling a sizeof() on an a_functor object, I found:
CPU code (.cpp and .cu outside kernel): 64 bytes
GPU code (inside kernel): 68 bytes
There was obviously some kind of mismatching that ruined the whole stuff. Going further, I tracked the distance between struct parameter pointers and struct itself - to try to inspect the produced memory layout - and here's what I found:
a_functor foo;
// CPU
(char*)(&foo.help_m) - (char*)(&foo) = 0
(char*)(&foo.x) - (char*)(&foo) = 16
(char*)(&foo.y) - (char*)(&foo) = 32
(char*)(&foo.x_ptr) - (char*)(&foo) = 48
(char*)(&foo.y_ptr) - (char*)(&foo) = 52
(char*)(&foo.bsx) - (char*)(&foo) = 56
(char*)(&foo.ind_thr) - (char*)(&foo) = 60
// GPU - inside a_functor::operator(), in-kernel
(char*)(&this->help_m) - (char*)(this) = 4
(char*)(&this->x) - (char*)(this) = 20
(char*)(&this->y) - (char*)(this) = 36
(char*)(&this->x_ptr) - (char*)(this) = 52
(char*)(&this->y_ptr) - (char*)(this) = 56
(char*)(&this->bsx) - (char*)(this) = 60
(char*)(&this->ind_thr) - (char*)(this) = 64
I really can't understand why nvcc generated this memory layout for my struct (what are that 4 bytes supposed to be/do!?!). I thought it could be an alignment problem and I tryed to explicitly align a_functor, but I can't because it is passed by value in kernel
template <typename T, typename Str>
__global__ void mykernel(Str foo, T* src, T*dst);
and when I try compile I get
error: cannot pass a parameter with a too large explicit alignment to a global routine on win32 platforms
So, to solve this strange situation (...and I do think that's an nvcc bug), what should I do? The only thing I can think of is playing with alignment and passing my struct to kernel by pointer to avoid the aforementioned error. However, I'm really wondering: why that memory layout mismatching?! It really makes no sense...
Further information: I'm using Visual Studio 2008, compiling with MSVC on Windows XP 32bit platform. I installed the latest CUDA Toolkit 5.0.35. My card is a GeForce GTX 570 (compute capability 2.0).