Dividing jobs for threads in Cuda using Thrust

Question

I have a testing code that needs to update keys inside a device_vector of a class. Therefore, how do I divide portions of the work to especific threads?

Example of the code without the division:

__global__ void UpdateKeys(Request* vector, int size, int seed, int qt_threads){
   curandState_t state;
   curand_init(seed, threadIdx.x, 0, &state);
   int id = blockIdx.x * blockDim.x + threadIdx.x;
   if(id < size){
       vector[i].key_ = (curand(&state % 100) / 100;
   }
}

That vector is passed as a thrust::device_vector.

Examples of what I want:

1000 keys and 2000 threads: use only 1000 and give a key to each one.
1000 keys and 1000 threads: use it all.
1 key and 100 threads: use 1 thread.
500 keys and 250 threads: each thread take care of 2.
240 keys and 80 threads: each thread take care of 3.

"e.g. 500 keys to update using 1000 threads, so each thread needs to take care of 2)"? — talonmies
how do I divide ...? do you ask because of performance concerns? — m.s.
Yes. I need maximum optimization, but I can't do it without knowing how to divide the work for specific threads like the examples. — Álvaro Espíndola

Unknown Unknown · Accepted Answer · 2017-01-29T17:13:12

If you modify your basic kernel structure like this:

__global__ void UpdateKeys(Request* vector, int size, int seed, int qt_threads){
   curandState_t state;
   curand_init(seed, threadIdx.x, 0, &state);
   int id = blockIdx.x * blockDim.x + threadIdx.x;
   int gid = blockDim.x * gridDim.x;
   for(; id < size; id += gid){
       vector[id].key_ = (curand(&state) % 100) / 100;
   }
}

then it should be possible for any legal one dimensional block size (and number of one dimensional blocks) to process as many or as few inputs as you choose to provide via the size parameter. If you run more threads than keys, some threads will do nothing. If you run less threads than keys, some threads will process multiple keys.

Dividing jobs for threads in Cuda using Thrust

1 Answers