4
votes

I am using TESLA T10 device and it has 2 cuda devices and maximum number of threads in a block is 512 and maximum threads along each dimension is (512,512,64) and maximum grid size is (65535,65535,1) and it has 30 multiprocessors on each cuda device.

now i want to know how many threads i can run in parallel.i read previous solutions but none of them clear my doubt. from previous read =(30)*512 threads i can run in parallel(maxNoOfMultiprocessor * maxThreadBlockSize)

but when i launched 32 blocks of 512 threads still it is working how is it possible??? i am not understanding these maximum threads in each dimension and also maximum grid size part please explain with an example....... thanks in advance

2
Maybe the last two blocks that crossed the limit goes for global-synchronization zone so first 30 blocks are finished first then the last two are in another execution queue. Maybe.huseyin tugrul buyukisik
that means we can launch any no of thread block with maximum no of threads in each block is 512 keeping in mind so first run 30*512 will executed then next 30*512 and so onuser2182259
But you cant be sure about which big block is executed before.huseyin tugrul buyukisik

2 Answers

5
votes

For the purposes of this discussion, forget about how many multiprocessors there are. It has nothing to do with how many blocks you can launch in a kernel (i.e. the grid.)

The number of threads you can run in parallel (i.e. that can execute simultaneously) is different than the number of threads you can launch, or the number of blocks you can launch.

Normally, you do not want to launch grids that have only as many threads as the machine can run at a given time (maxNoOfMultiprocessor * maxThreadBlockSize). The machine wants many more threads than that, so it can hide latency.

Your machine is limited to 512 threads per block, but you can launch a single-dimensional grid of up to 65535 blocks. This does not mean that all those blocks/threads are running simultaneously, but the machine will process them all eventually.

4
votes

You can create many more threads than the hardware is able to handle simultaneously. This is called 'automatic scalability' by nVidia. If you have a card with 30 SMX, 30 blocs will be run in parallel, then 2 blocks will be run afterwards. If you then run the same program with 32 blocs on a card with only 16 SMX (let's suppose that exists), 16 blocks are run, and then 16 others.