Is there any relation between the size of the shared memory and the maximum number of threads per block?. In my case I use Max threads per block = 512, my program makes use of all the threads and it uses considerable amount of shared memory.
Each thread has to do a particular task repeatedly. For example my kernel might look like,
int threadsPerBlock = (blockDim.x * blockDim.y * blockDim.z);
int bId = (blockIdx.x * gridDim.y * gridDim.z) + (blockIdx.y * gridDim.z) + blockIdx.z;
for(j = 0; j <= N; j++) {
tId = threadIdx.x + (j * threadsPerBlock);
uniqueTid = bId*blockDim.x + tId;
curand_init(uniqueTid, 0, 0, &seedValue);
randomP = (float) curand_uniform( &seedValue );
if(randomP <= input_value)
/* Some task */
else
/* Some other task */
}
But my threads are not going into next iteration (say j = 2). Am i missing something obvious here?