Theoretical question about CUDA and GPU parallel calculations.
As I know, kernel is a code, function, which is execute by GPU. Each kernel has a(is executed by) grid which consists blocks and blocks have threads. So each kernel(code) is executed by even thousands of threads.
I have question about shared memory and kernel codes synchronization. Could you justify the necessity of synchronization in kernel codes which are using shared memory? How the synchronization affects the processing efficiency?
__syncthreads()is frequently found in kernels that use shared memory, after the shared memory load, to prevent race conditions. Since the shared memory is usually loaded cooperatively (by all threads in the block), it's necessary to make sure that all threads have completed the loading operation, before any thread begins to use the loaded data for further processing. - Robert Crovella