7
votes

I have the following (snippet) of a kernel.

__global__ void plain(int* geneVec, float* probs, int* nComponents, float* randomNumbers,int *nGenes)
{

    int xid = threadIdx.x + (blockDim.x * blockIdx.x);

    float* currentProbs= (float*)malloc(sizeof(float)*tmp);

         .....
         .....

    currentProbs[0] = probs[start];
    for (k=1;k<nComponents[0]; k++)
    {
        currentProbs[k] = currentProbs[k-1] + prob;
    }

       ...
       ...
      free(currentProbs);

}

When it's static (even the same sizes) it's very fast, but when CurrentProbs is dynamically allocated (as above) performance is awful.

This question said I could do this inside a kernel: CUDA allocate memory in __device__ function

Here is a related question: Efficiency of Malloc function in CUDA

I was wondering if any other methods have solved this other than the one proposed in the paper? It seems ridiculous that one cannot malloc/free inside a kernel without this sort of penalty.

1
Where does tmp come from in your pseudo code?talonmies
So it is constant per kernel invocation? If so, why bother with dyanmic memory allocation at all?talonmies

1 Answers

10
votes

I think the reason introducing malloc() slows your code down is that it allocates memory in global memory. When you use a fixed size array, the compiler is likely to put it in the register file, which is much faster.

Having to do a malloc inside your kernel may mean that you're trying to do too much work with a single kernel. If each thread allocates a different amount of memory, then each thread runs a different number of times in the for loop, and you get lots of warp divergence.

If each thread in a warp runs loops the same number of times, just allocate up front. Even if they run a different number of times, you can use a constant size. But instead, I think you should look at how you can refactor your code to entirely remove that loop from your kernel.