Threads hierarchy design in CUDA for my code

Question

I want to convert my previous code in c++ to CUDA

for(int x=0 ; x < 100; x++)
{
    for(int y=0 ; y < 100; y++)
    {
        for(int w=0 ; w < 100; w++)
        {
            for(int z=0 ; z < 100; z++)
            {
              ........
            }
        }
    }
}

these loops combine to make a new int value.

if I want to use CUDA I have to design threads hierarchy before building the kernel code.

So How can I design the hierarchy ?

depend on every loop I think it will be like this:

100*100*100*100 = 100000000 thread .

Could you help me

Thanks

My CUDA spec:

CUDA Device #0

Major revision number: 1

Minor revision number: 1

Name: GeForce G 105M

Total global memory: 536870912

Total shared memory per block: 16384

Total registers per block: 8192

Warp size: 32

Maximum memory pitch: 2147483647

Maximum threads per block: 512

Maximum dimension 1 of block: 512

Maximum dimension 2 of block: 512

Maximum dimension 3 of block: 64

Maximum dimension 1 of grid: 65535

Maximum dimension 2 of grid: 65535

Maximum dimension 3 of grid: 1

Clock rate: 1600000

Total constant memory: 65536

Texture alignment: 256

Concurrent copy and execution: No

Number of multiprocessors: 1

Kernel execution timeout: Yes

Robert Crovella Robert Crovella · Accepted Answer · 2015-04-22T13:33:49

100000000 threads (or blocks) is not too many for a GPU.

Your GPU has compute capability 1.1, so it is limited to 65535 blocks in each of the first two grid dimensions (x and y). Since 100*100 = 10000, we could launch 10000 blocks in each of the first two grid dimensions, to cover your entire for-loop extent. This would launch one block per for-loop iteration (unique combination of x,y,z, and w) and assume that you would use the threads in a block to address the needs of your for-loop calculation code:

__global__ void mykernel(...){

  int idx = blockIdx.x;
  int idy = blockIdx.y;

  int w = idx/100;
  int z = idx%100;
  int x = idy/100;
  int y = idy%100;

  int tx = threadIdx.x;

 // (the body of your for-loop code here...

}

launch:

dim3 blocks(10000, 10000);
dim3 threads(...); // can use any number here up to 512 for your device
mykernel<<<blocks, threads>>>(...);

If instead, you wanted to assign one thread to each of the inner z iterations of your for-loop (might be useful/higher performance depending on what you are doing and your data organization) you could do something like this:

__global__ void mykernel(...){

  int idx = blockIdx.x;
  int idy = blockIdx.y;

  int w = idx/100;
  int x = idx%100;
  int y = idy;

  int z = threadIdx.x;

 // (the body of your for-loop code here...

}

launch:

dim3 blocks(10000, 100);
dim3 threads(100); 
mykernel<<<blocks, threads>>>(...);

All of the above assumes your for-loop iterations are independent. If your for-loop iterations are dependent on each other (dependent on the order of execution) then these simplistic answers won't work, and you have not provided enough information in your question to discuss a reasonable strategy.

Threads hierarchy design in CUDA for my code

1 Answers