I am using TESLA T10 device and it has 2 cuda devices and maximum number of threads in a block is 512 and maximum threads along each dimension is (512,512,64) and maximum grid size is (65535,65535,1) and it has 30 multiprocessors on each cuda device.
now i want to know how many threads i can run in parallel.i read previous solutions but none of them clear my doubt. from previous read =(30)*512 threads i can run in parallel(maxNoOfMultiprocessor * maxThreadBlockSize)
but when i launched 32 blocks of 512 threads still it is working how is it possible??? i am not understanding these maximum threads in each dimension and also maximum grid size part please explain with an example....... thanks in advance