1
votes

As known, OpenCL vector-type float16

enter image description here


As a result:

I.e. vector-types such as float16 does not matter much for the GPU, but are of great importance for the CPU.

Should we use the vector-types, if we want to write once optimized OpenCL-code for both architectures: CPU and GPU?


CONCLUSION:

Vector types are not much needed for GPU or Intel-CPU, but needed for AMD-CPU.

1
Did you check how many VGPRs are used when using float16 vs float, using an ISA code output from a profiler like CodeXL ?huseyin tugrul buyukisik
@huseyin tugrul buyukisik No, I didn't. What do you mean, are there some mistakes in my statements?Alex
no just telling some optimizations are seen that way. For example my gpu compiles to use vgpr even When i dont use vectors. Vgpr have more memory than sgpr in my amd gpuhuseyin tugrul buyukisik
Its more like readability on "scalar" architectures (even if they work on SIMDs)huseyin tugrul buyukisik
as far as I know, GPUs have deep pipelines too, so there shouldn't be a reason not to complete 1 float while issuing other 3. Also I've read somewhere that GCN was capable of completing 1 vector element fp in 4 cycles(so must be like 7-8 for 4 elements) on top of 1 scalar element fp in same 4 cycles using instruction level parallelismhuseyin tugrul buyukisik

1 Answers

2
votes

In general, if performance is what you're concerned about, it is almost always a bad idea to use a same kernel for different architectures. Pre-GCN's want vectors, GCN's want scalars, CPU's can handle both with Intel driver but only if you are awared of it, and I don't know how AMD's driver is doing on a CPU. While CPU need wider vectors than GPU. CPU's rely on cache and GPU's rely more on scratch memory. GPU's have insanely more registers than CPU's can even dream of...

On GCN's actually vector types just make me feel my code looks nicer, and save some time on typing and making mistakes. float v[4], float4 v, or even float v0, v1, v2, v3, doesn't make much difference for the most of time.

And as said before, Intel's CL driver can map a thread to a SIMD element, which make one core 8 CL threads.