As known, there are WARP (in CUDA) and WaveFront (in OpenCL): http://courses.cs.washington.edu/courses/cse471/13sp/lectures/GPUsStudents.pdf
4.1. SIMT Architecture
...
A warp executes one common instruction at a time, so full efficiency is realized when all 32 threads of a warp agree on their execution path. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken, disabling threads that are not on that path, and when all paths complete, the threads converge back to the same execution path. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjoint code paths.
The SIMT architecture is akin to SIMD (Single Instruction, Multiple Data) vector organizations in that a single instruction controls multiple processing elements. A key difference is that SIMD vector organizations expose the SIMD width to the software, whereas SIMT instructions specify the execution and branching behavior of a single thread.
- WaveFront in OpenCL: https://sites.google.com/site/csc8820/opencl-basics/opencl-terms-explained#TOC-Wavefront
During runtime, the first wavefront is sent to the compute unit to run, then the second wavefront is sent to the compute unit, and so on. Work items within one wavefront are executed in parallel and in lock steps. But different wavefronts are executed sequentially.
I.e. we know, that:
threads in WARP (CUDA) - are SIMT-threads, which always executes the same instructions at each time and always are stay synchronized - i.e. threads of WARP are the same as lanes of SIMD (on CPU)
threads in WaveFront (OpenCL) - are threads, which always executes in parallel, but not necessarily all the threads perform the exact same instruction, and not necessarily all of the threads are synchronized
But is there any guarantee that all of the threads in the WaveFront always synchronized such as threads in WARP or as lanes in SIMD?
Conclusion:
- WaveFront-threads (items) are always synchronized - lock step: "wavefront executes a number of work-items in lock step relative to each other."
- WaveFront mapped on SIMD-block: "all work-items in the wavefront go to both paths of flow control"
- I.e. each WaveFront-thread (item) mapped to SIMD-lanes
Chapter 1 OpenCL Architecture and AMD Accelerated Parallel Processing
1.1 Terminology
...
Wavefronts and work-groups are two concepts relating to compute kernels that provide data-parallel granularity. A wavefront executes a number of work-items in lock step relative to each other. Sixteen workitems are execute in parallel across the vector unit, and the whole wavefront is covered over four clock cycles. It is the lowest level that flow control can affect. This means that if two work-items inside of a wavefront go divergent paths of flow control, all work-items in the wavefront go to both paths of flow control.
This is true for: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_OpenCL_Programming_Optimization_Guide2.pdf
- (page-45) Chapter 2 OpenCL Performance and Optimization for GCN Devices
- (page-81) Chapter 3 OpenCL Performance and Optimization for Evergreen and Northern Islands Devices