As known, OpenCL vector-type float16
float16
on AMD GPU (GCN) doesn't use addition vector operations, because vector operations used even without vector-types by using WaveFront (each thread = each SIMD-lane). I.e.float16
can help only for load/store on large width bus of memory, for example on HBM (High Bandwidth Memory): https://stackoverflow.com/a/42315728/1558037but
float16
on AMD CPU is recommended to use for involving SIMD-lanes of CPU (because each thread = each whole CPU-Core, not SIMD-lane): http://developer.amd.com/tools-and-sdks/opencl-zone/opencl-resources/programming-in-opencl/image-convolution-using-opencl/image-convolution-using-opencl-a-step-by-step-tutorial-5/
As a result:
on GCN's one thread views one SIMD element - i.e. one thread mapped on one SIMD-lane): Is there any guarantee that all of threads in WaveFront (OpenCL) always synchronized?
on CPU one thread mapped on whole one CPU-Core (with many SIMD-blocks each with many SIMD-lanes)
I.e. vector-types such as float16
does not matter much for the GPU, but are of great importance for the CPU.
Should we use the vector-types, if we want to write once optimized OpenCL-code for both architectures: CPU and GPU?
CONCLUSION:
Vector types are not much needed for GPU or Intel-CPU, but needed for AMD-CPU.