Parallelism in GPU - CUDA / OpenCL

Question

I have a general questions about parallelism in CUDA or OpenCL code on GPU. I use NVIDIA GTX 470.

I read briefly in the Cuda programming guide, but did not find related answers hence asking here.

I have a top level function which calls the CUDA kernel(For same kernel I have a OpenCL version of it). This top level function itself is called 3 times in a 'for loop' from my main function, for 3 different data sets(Image data R,G,B) and the actual codelet also has processing over all the pixels in the image/frame so it has 2 'for loops'.

What I want to know is what kind of parallelism is exploited here - task level parallelism or data parallelism?

So what i want to understand is does does this CUDA and C code create multiple threads for different functionality/functions in the codelet and top level code and executes them in parallel and exploits task parallelism. If yes, who creates it as there is no threading library explicitly included in code or linked with.

OR

It creates threads/tasks for different 'for loop' iterations which are independent and thus achieving data parallelism. If it does this kind of parallelism, does it exploit this just by noting that different for loop iterations have no dependencies and hence can be scheduled in parallel?

Because I don't see any special compiler constructs/intrinsics(parallel for loops as in openMP) which tells the compiler/scheduler to schedule such for loops / functions in parallel?

Any reading material would help.

Michael Haidl Michael Haidl · Accepted Answer · 2011-12-22T14:26:04

Parallelism on GPUs is SIMT (Single Instruction Multiple Threads). For CUDA Kernels, you specify a grid of blocks where every block has N threads. The CUDA library does all the trick and the CUDA Compiler (nvcc) generates the GPU code which is executed by the GPU. The CUDA library tells the GPU driver and further more the thread scheduler on the GPU how many threads should execute the kernel ((number of blocks) x (number of threads)). In your example the top level function (or host function) executes only the kernel call which is asyncronous and returns emediatly. No threading library is needed because nvcc creates the calls to the driver.

A sample kernel call looks like this:

helloworld<<<BLOCKS, THREADS>>>(/* maybe some parameters */);

OpenCL follows the same paradigm but you compile yor kernel (if they are not precompiled) at runtime. Specify the number of threads to execute the kernel and the lib does the rest.

The best way to learn CUDA (OpenCL) is to look in the CUDA Programming Guide (OpenCL Programming Guide) and look at the samples in the GPU Computing SDK.

Parallelism in GPU - CUDA / OpenCL

3 Answers