Should we use the vector-types, if we want to write once optimized code for both: CPU and GPU?

Question

As known, OpenCL vector-type float16

float16 on AMD GPU (GCN) doesn't use addition vector operations, because vector operations used even without vector-types by using WaveFront (each thread = each SIMD-lane). I.e. float16can help only for load/store on large width bus of memory, for example on HBM (High Bandwidth Memory): https://stackoverflow.com/a/42315728/1558037
but float16 on AMD CPU is recommended to use for involving SIMD-lanes of CPU (because each thread = each whole CPU-Core, not SIMD-lane): http://developer.amd.com/tools-and-sdks/opencl-zone/opencl-resources/programming-in-opencl/image-convolution-using-opencl/image-convolution-using-opencl-a-step-by-step-tutorial-5/

As a result:

on GCN's one thread views one SIMD element - i.e. one thread mapped on one SIMD-lane): Is there any guarantee that all of threads in WaveFront (OpenCL) always synchronized?
on CPU one thread mapped on whole one CPU-Core (with many SIMD-blocks each with many SIMD-lanes)

I.e. vector-types such as float16 does not matter much for the GPU, but are of great importance for the CPU.

Should we use the vector-types, if we want to write once optimized OpenCL-code for both architectures: CPU and GPU?

CONCLUSION:

Vector types are not much needed for GPU or Intel-CPU, but needed for AMD-CPU.

Did you check how many VGPRs are used when using float16 vs float, using an ISA code output from a profiler like CodeXL ? — huseyin tugrul buyukisik
@huseyin tugrul buyukisik No, I didn't. What do you mean, are there some mistakes in my statements? — Alex
no just telling some optimizations are seen that way. For example my gpu compiles to use vgpr even When i dont use vectors. Vgpr have more memory than sgpr in my amd gpu — huseyin tugrul buyukisik
Its more like readability on "scalar" architectures (even if they work on SIMDs) — huseyin tugrul buyukisik
as far as I know, GPUs have deep pipelines too, so there shouldn't be a reason not to complete 1 float while issuing other 3. Also I've read somewhere that GCN was capable of completing 1 vector element fp in 4 cycles(so must be like 7-8 for 4 elements) on top of 1 scalar element fp in same 4 cycles using instruction level parallelism — huseyin tugrul buyukisik

BlueWanderer BlueWanderer · Accepted Answer · 2017-02-20T10:27:59

In general, if performance is what you're concerned about, it is almost always a bad idea to use a same kernel for different architectures. Pre-GCN's want vectors, GCN's want scalars, CPU's can handle both with Intel driver but only if you are awared of it, and I don't know how AMD's driver is doing on a CPU. While CPU need wider vectors than GPU. CPU's rely on cache and GPU's rely more on scratch memory. GPU's have insanely more registers than CPU's can even dream of...

On GCN's actually vector types just make me feel my code looks nicer, and save some time on typing and making mistakes. float v[4], float4 v, or even float v0, v1, v2, v3, doesn't make much difference for the most of time.

And as said before, Intel's CL driver can map a thread to a SIMD element, which make one core 8 CL threads.

Should we use the vector-types, if we want to write once optimized code for both: CPU and GPU?

1 Answers