1
votes

I have a CUDA card with : Cuda Compute capability (3.5) If i have a call such as <<<2000,512>>> , what are the number of iterations that occur within the kernel? I thought it was (2000*512), but testing isn't proving this? I also want to confirm that the way I'm calculating the the variable is correct.

The situation is, within the kernel I am incrementing a passed global memory number based on the thread number :

  int thr = blockDim.x * blockIdx.x + threadIdx.x;
  worknumber = globalnumber + thr;

So, when I return back to the CPU, I want to know exactly how many increments there were so I can keep track so I don't repeat or skip numbers when I recall the kernel GPU to process my next set of numbers.

Edit :

__global__ void allin(uint64_t *lkey, const unsigned char *d_patfile)
{

    uint64_t kkey;
    int tmp;
    int thr = blockDim.x * blockIdx.x + threadIdx.x;
    kkey = *lkey + thr;

if (thr > tmp) {
    tmp = thr;
    printf("%u \n", thr);
    }
}
1
You are going to have to explain what you mean by iterations. CUDA isn't, by design, iterative in any waytalonmies
when i say iterations, i mean how many times is the code within the kernel being run. What are the total threads? I am not entirely sure i have my naming conventions right when i say threads. The definition of iterations : "the repetition of a sequence of computer instructions a specified number of times or until a condition is met". Isn't that kinda exactly what takes place?John Styles
Can you expand that code snippet into your kernel. I would guess you are seeing the effects of a memory race, but without seeing code it is impossible to saytalonmies
Your kernel doesn't make sense from a c/c++ perspective. You are not initializing tmp to any value before testing it in the if statement. I would think the compiler would be throwing a warning about that. The number of threads or "iterations" created by <<<2000,512>>> is indeed 2000*512. printf from a cuda kernel has various limitations, so using it to validate that a large number of threads were launched probably won't work.Robert Crovella
If you want to prove the 2000*512 number to yourself, then create a single __device__ global variable, initialize it to zero, then have each thread do atomicAdd(&var, 1); After that, copy the variable back to host code and print it out.Robert Crovella

1 Answers

4
votes

If you launch a kernel with the configuration <<<X,Y>>>, and you have not violated any rules of CUDA usage, then the number of threads launched will, in fact, be X*Y (or a suitable modification of that if we are talking about 2 or 3 dimensional threadblocks and/or grids, i.e. X.x*X.y*X.z*Y.x*Y.y*Y.z ).

printf from a CUDA kernel has various limitations. Therefore, generating a large amount of printf output from a CUDA kernel is generally unwise and probably not useful for validating the number of threads launched in a large grid.

If you want to keep track of the number of threads that actually get launched, you could use a global variable and have each thread atomically update it. Something like this:

$ cat t848.cu
#include <stdio.h>

__device__ unsigned long long totThr = 0;

__global__ void mykernel(){

  atomicAdd(&totThr, 1);
}

int main(){

  mykernel<<<2000,512>>>();
  unsigned long long total;
  cudaMemcpyFromSymbol(&total, totThr, sizeof(unsigned long long));
  printf("Total threads counted: %lu\n", total);
}
$ nvcc -o t848 t848.cu
$ cuda-memcheck ./t848
========= CUDA-MEMCHECK
Total threads counted: 1024000
========= ERROR SUMMARY: 0 errors
$

Note that atomic operations may be relatively slow. I wouldn't recommend making regular use of such a code for performance reasons. But if you want to convince yourself of the number of threads launched, it should give the correct answer.