“unknown error” while using dynamic allocation inside device function in CUDA

Question

I'm trying to implement a linked list in a CUDA application to model a growing network. In oder to do so I'm using malloc inside the __device__ function, aiming to allocate memory in the global memory. The code is:

void __device__ insereviz(Vizinhos **lista, Nodo *novizinho, int *Gteste)
{
   Vizinhos *vizinho;

   vizinho=(Vizinhos *)malloc(sizeof(Vizinhos));

   vizinho->viz=novizinho;

   vizinho->proxviz=*lista;

   *lista=vizinho;

   novizinho->k=novizinho->k+1;
}

After a certain number of allocated elements (around 90000) my program returns "unknown error". At first I though it was a memory constraint, but I checked nvidia-smi and I've got

+------------------------------------------------------+                       
| NVIDIA-SMI 331.38     Driver Version: 331.38         |                       
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 770     Off  | 0000:01:00.0     N/A |                  N/A |
| 41%   38C  N/A     N/A /  N/A |    159MiB /  2047MiB |     N/A      Default |
+-------------------------------+----------------------+----------------------+

So it doesn't seem a memory problem, unless malloc is allocating inside the shared memory. To test this I've tried to run two networks in separated blocks, and still have a limitation in the number of structures I'm able to allocate. But when I try to run two instances of the same program with a smaller number of structures they both finish without error.

I also have tried cuda-memcheck and got

========= CUDA-MEMCHECK
========= Invalid __global__ write of size 8
=========     at 0x000001b0 in     /work/home/melo/proj_cuda/testalloc/cuda_testamalloc.cu:164:insereviz(neighbor**, node*, int*)
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x00000000 is out of bounds
=========     Device Frame:/work/home/melo/proj_cuda/testalloc/cuda_testamalloc.cu:142:insereno(int, int, node**, node**, int*) (insereno(int, int, node**, node**, int*) : 0x648)
=========     Device Frame:/work/home/melo/proj_cuda/testalloc/cuda_testamalloc.cu:111:fazrede(node**, int, int, int, int*) (fazrede(node**, int, int, int, int*) : 0x4b8)
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/libcuda.so.1 (cuLaunchKernel + 0x331) [0x138281]
=========     Host Frame:gpu_testamalloc5 [0x1bd48]
=========     Host Frame:gpu_testamalloc5 [0x3b213]
=========     Host Frame:gpu_testamalloc5 [0x2fe3]
=========     Host Frame:gpu_testamalloc5 [0x2e39]
=========     Host Frame:gpu_testamalloc5 [0x2e7f]
=========     Host Frame:gpu_testamalloc5 [0x2c2f]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xfd) [0x1eead]
=========     Host Frame:gpu_testamalloc5 [0x2829]

Is there any restriction in the kernel launch or something I'm missing? How can I check it?

Thank you,

Ricardo

Why are you not checking the value returned by malloc for validity? — talonmies

Robert Crovella Robert Crovella · Accepted Answer · 2014-05-28T16:12:11

The most likely reason is that you are running out of space on the "device heap". This is initially defaulting to 8MB, but you can change it.

Referring to the documentation, we see that device malloc allocates out of the device heap.

If an error occurs, a NULL pointer will be returned by malloc. It's good practice to test for this NULL pointer in device code (and in host code -- it's no different from host malloc in this respect). If you get a NULL pointer, you have run out of device heap space.

As indicated in the documentation, the size of the device heap can be adjusted before your kernel call by using the:

cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size)

runtime API function.

If you ignore all this and attempt to use the NULL pointer returned anyway, you'll get invalid accesses in device code, like this:

=========     Address 0x00000000 is out of bounds

“unknown error” while using dynamic allocation inside __device__ function in CUDA

1 Answers

“unknown error” while using dynamic allocation inside device function in CUDA