0
votes

As part of my thesis work I am working in a CUDA project (modifying somebody elses code, adding functionality, etc). Being new to CUDA this is turning to be a real challenge for me. I am working with compute capability 1.3 cards, 4 x Tesla C1060. And sadly, I am hitting some limitations of the platform.

I need to pass a couple of new structures to device, which I believe are been copied correctly. But, when trying to pass the pointers to structure on device on my kernel call I reach the 256 bytes limit (as addressed in this question).

My code goes like this:

// main.cu
static void RunGPU(HostThreadState *hstate)
{
  SimState *HostMem = &(hstate->host_sim_state);
  SimState DeviceMem;

  TetrahedronStructGPU *h_root = &(hstate->root);
  TetrahedronStructGPU *d_root;
  TriangleFacesGPU *h_faces = &(hstate->faces);
  TriangleFacesGPU *d_faces;

  GPUThreadStates tstates;

  unsigned int n_threads = hstate->n_tblks * NUM_THREADS_PER_BLOCK;
  unsigned int n_tetras  = hstate->n_tetras; // 9600
  unsigned int n_faces   = hstate->n_faces;  // 38400

  InitGPUStates(HostMem, h_root, h_faces, &DeviceMem, &tstates, hstate->sim, 
                d_root, d_faces, n_threads, n_tetras, n_faces );
  cudaThreadSynchronize();

  ...

  kernel<<<dimGrid, dimBlock, k_smem_sz>>>(DeviceMem, tstates, /*OK, these 2*/
                                           d_root, d_faces);
                           // Limit of 256 bytes adding d_root and/or d_faces
  cudaThreadSynchronize();

  ...

}

The InitGPUStates function is in another source file:

// kernel.cu
int InitGPUStates(SimState* HostMem, TetrahedronStructGPU* h_root,
                  TriangleFacesGPU* h_faces,
                  SimState* DeviceMem, GPUThreadStates *tstates,
                  SimulationStruct* sim, 
                  TetrahedronStructGPU* d_root, TriangleFacesGPU* d_faces,
                  int n_threads, int n_tetras, int n_faces)
{
  unsigned int size;

  // Allocate and copy RootTetrahedron (d_root) on device
  size = n_tetras * sizeof(TetrahedronStructGPU); // Too big
  checkCudaErrors(cudaMalloc((void**)&d_root, size));
  checkCudaErrors(cudaMemcpy(d_root, h_root, size, cudaMemcpyHostToDevice));

  // Allocate and copy Faces (d_faces) on device
  size = n_faces * sizeof(TriangleFacesGPU); // Too big
  checkCudaErrors(cudaMalloc((void**)&d_faces, size));
  checkCudaErrors(cudaMemcpy(d_faces, h_faces, size, cudaMemcpyHostToDevice));     

  ...
}

I understand that I need to pass only pointers to the locations on device memory. How can I get the address in device? Is this passing of pointers correctly done?

The two new structures are:

// header.h
typedef struct {
  int idx;
  int vertices[4];
  float Nx, Ny, Nz, d;
} TriangleFacesGPU;

typedef struct {
  int idx, region;
  int vertices[4], faces[4], adjTetras[4];
  float n, mua, mus, g;
} TetrahedronStructGPU;

// other structures
typedef struct {
  BOOLEAN *is_active;
  BOOLEAN *dead;
  BOOLEAN *FstBackReflectionFlag;
  int *NextTetrahedron;
  UINT32 *NumForwardScatters;
  UINT32 *NumBackwardScatters;
  UINT32 *NumBackwardsSpecularReflections;
  UINT32 *NumBiases;
  UINT32 *p_layer;
  GFLOAT *p_x, *p_y, *p_z;
  GFLOAT *p_ux, *p_uy, *p_uz;
  GFLOAT *p_w;
  GFLOAT *Rspecular;
  GFLOAT *LocationFstBias;
  GFLOAT *OpticalPath;
  GFLOAT *MaxDepth;
  GFLOAT *MaxLikelihoodRatioIncrease;
  GFLOAT *LikelihoodRatioIncreaseFstBias;
  GFLOAT *LikelihoodRatio;
  GFLOAT *LikelihoodRatioAfterFstBias;
  GFLOAT *s, *sleft;
  TetrahedronStructGPU *tetrahedron;
  TriangleFacesGPU *faces;
} GPUThreadStates;

typedef struct {
  UINT32 *n_p_left;
  UINT64 *x;
  UINT32 *a;
  UINT64 *Rd_ra;
  UINT64 *A_rz;
  UINT64 *Tt_ra;
} SimState;

The definition of kernel is

__global__ void kernel(SimState d_state, GPUThreadStates tstates,
                       TetrahedronStructGPU *d_root,
                       TriangleFacesGPU *d_faces);

I will work on changing SimState d_state to pointer pass SimState *d_state. As well as GPUThreadStates tstates to GPUThreadStates *tstates.

2
what is the declaration of kernel? For example you appear to be passing tstates by value to kernel. If sizeof(GPUThreadStates) is large, you can free up some breathing room by passing that structure by pointer rather than by value. The problem is, d_root and d_faces are already pointers. So if you're out of parameter space just adding those two pointers, you're going to need to shrink the size of something else you are passing, like DeviceMem (sizeof(SimState)) and tstates (sizeof(GPUThreadStates)). This will also affect your kernel code referencing these entities. - Robert Crovella
@RobertCrovella You are right. I wasn't sure I was doing the pointer passing properly. The kernel definition __global__ void MCMLKernel(SimState d_state, GPUThreadStates tstates, TetrahedronStructGPU *d_root, TriangleFacesGPU *d_faces) and both d_state and tstates are being passed by value, aren't they? - mrei
Yes, they appear to be, although you haven't actually shown the definition of GPUThreadStates and SimState. If the sizes of those are large, preventing you from adding d_root (a pointer) and d_faces (a pointer), then you will have to focus on those. - Robert Crovella
@RobertCrovella Thanks again. I am working on those, GPUThreadStates and SimState are quite big too. I am adding those definitions above. - mrei
@RobertCrovella I posted the modifications I made as an answer for better formatting. I am having errors code=11(cudaErrorInvalidValue) "cudaMalloc((void**)&DeviceMem->n_photons_left, size)" . I would really appreciate your help! Thanks! - mrei

2 Answers

1
votes

It seems that you haven't initialized the DeviceMem structure, which is supposed to hold the pointer that should be later initialized with cudaMalloc.

You should do something like:

SimState* DeviceMem;

cudaMalloc(&DeviceMem, sizeof(SimState)) 

too (or any other way to allocate memory for that pointer).

0
votes

Finally, solved the 256 bytes issue. But, really still lost in pointers

My modified code goes like this:

// main.cu
static void RunGPU(HostThreadState *hstate)
{
  SimState *HostMem = &(hstate->host_sim_state);

  // new pointers to pass
  SimState *DeviceMem = (SimState*)malloc(sizeof(SimState));
  GPUThreadStates *tstates = (GPUThreadStates*)malloc(sizeof(GPUThreadStates));

  TetrahedronStructGPU *h_root = hstate->root; //root, pointer in HostThreadState
  TetrahedronStructGPU *d_root;
  TriangleFacesGPU *h_faces = hstate->faces; //faces, pointer in HostThreadState
  TriangleFacesGPU *d_faces;

  unsigned int n_threads = hstate->n_tblks * NUM_THREADS_PER_BLOCK;
  unsigned int n_tetras  = hstate->n_tetras; // 9600
  unsigned int n_faces   = hstate->n_faces;  // 38400

  InitGPUStates(HostMem, h_root, h_faces, DeviceMem, tstates, hstate->sim, 
                d_root, d_faces, n_threads, n_tetras, n_faces );
  cudaThreadSynchronize();

  ...

  kernel<<<dimGrid, dimBlock, k_smem_sz>>>(DeviceMem, tstates,
                                           d_root, d_faces);
                                         // No limit reached!
  cudaThreadSynchronize();

  ...      
}

In the InitGPUStates function the changes are as follow. Special attention to the copy of DeviceMem (I tried many forms without success). Some forms (with parenthesis, like this cudaMalloc((void **)&(*DeviceMem).n_p_left, size)) will not give me any error. I am assuming that no errors means no data copied to device. In the current form the error is code=11(cudaErrorInvalidValue) "cudaMalloc((void**)&DeviceMem->n_photons_left, size)".

// kernel.cu
int InitGPUStates(SimState* HostMem, TetrahedronStructGPU* h_root,
                  TriangleFacesGPU* h_faces,
                  SimState* DeviceMem, GPUThreadStates *tstates,
                  SimulationStruct* sim, 
                  TetrahedronStructGPU* d_root, TriangleFacesGPU* d_faces,
                  int n_threads, int n_tetras, int n_faces)
{
  unsigned int size;

  // Allocate and copy RootTetrahedron (d_root) on device
  size = n_tetras * sizeof(TetrahedronStructGPU); // Too big
  checkCudaErrors(cudaMalloc((void**)&d_root, size));
  checkCudaErrors(cudaMemcpy(d_root, h_root, size, cudaMemcpyHostToDevice));

  // Allocate and copy Faces (d_faces) on device
  size = n_faces * sizeof(TriangleFacesGPU); // Too big
  checkCudaErrors(cudaMalloc((void**)&d_faces, size));
  checkCudaErrors(cudaMemcpy(d_faces, h_faces, size, cudaMemcpyHostToDevice));     

  // HELP NEEDED MAINLY FROM HERE REGARDING POINTER VALUE COPY!
  checkCudaErrors( cudaMalloc((void**)&DeviceMem, sizeof(SimState) ); //Needed?

  size = sizeof(UINT32);
  checkCudaErrors( cudaMalloc(&DeviceMem->n_p_left, size) );
  checkCudaErrors( cudaMemcpy(DeviceMem->n_p_left,
                   HostMem->n_p_left, size, cudaMemcpyHostToDevice) );

  size = n_threads * sizeof(UINT32);
  checkCudaErrors( cudaMalloc(&DeviceMem->a, size) );
  checkCudaErrors( cudaMemcpy(DeviceMem->a, HostMem->a, size,
                                      cudaMemcpyHostToDevice) );
  size = n_threads * sizeof(UINT64);
  checkCudaErrors( cudaMalloc(&DeviceMem->x, size) );
  checkCudaErrors( cudaMemcpy(DeviceMem->x, HostMem->x, size,
                                      cudaMemcpyHostToDevice) );
  ...
}

I understand that I need to pass only pointers to the locations on device memory. How can I get the address in device? Is this passing of pointers correctly done?

The two new structures are:

// header.h
typedef struct {
  int idx;
  int vertices[4];
  float Nx, Ny, Nz, d;
} TriangleFacesGPU;

typedef struct {
  int idx, region;
  int vertices[4], faces[4], adjTetras[4];
  float n, mua, mus, g;
} TetrahedronStructGPU;

// other structures
typedef struct {
  BOOLEAN *is_active;
  BOOLEAN *dead;
  BOOLEAN *FstBackReflectionFlag;
  int *NextTetrahedron;
  UINT32 *NumForwardScatters;
  UINT32 *NumBackwardScatters;
  UINT32 *NumBackwardsSpecularReflections;
  UINT32 *NumBiases;
  UINT32 *p_layer;
  GFLOAT *p_x, *p_y, *p_z;
  GFLOAT *p_ux, *p_uy, *p_uz;
  GFLOAT *p_w;
  GFLOAT *Rspecular;
  GFLOAT *LocationFstBias;
  GFLOAT *OpticalPath;
  GFLOAT *MaxDepth;
  GFLOAT *MaxLikelihoodRatioIncrease;
  GFLOAT *LikelihoodRatioIncreaseFstBias;
  GFLOAT *LikelihoodRatio;
  GFLOAT *LikelihoodRatioAfterFstBias;
  GFLOAT *s, *sleft;
  TetrahedronStructGPU *tetrahedron;
  TriangleFacesGPU *faces;
} GPUThreadStates;

typedef struct {
  UINT32 *n_p_left;
  UINT64 *x;
  UINT32 *a;
  UINT64 *Rd_ra;
  UINT64 *A_rz;
  UINT64 *Tt_ra;
} SimState;

The definition of kernel is changed to:

__global__ void kernel(SimState *d_state, GPUThreadStates *tstates,
                       TetrahedronStructGPU *d_root,
                       TriangleFacesGPU *d_faces);