I am looking for a AVX-256/512 code for float4 / double4 struct that overloads the basic operations *,+,/,-,scale by scalar, etc to get a quick performance boost from vector operations in a code written using float4/double4. OpenCL has these data types as intrinsics but c++ code running on the XeonPhi needs new implementations taking advantage of the 512-bit SIMD units.
1 Answers
What you are seeking is Agner Fog's Vector Class Library(VCL). I have used this mostly replace the vector types in OpenCL.
With the VCL float4 is Vec4f and double4 is Vec4d. Like OpenCL you don't need to worry about AVX vs AVX512. If you use Vec8d and compile for AVX it will emulate AVX512 using two AVX registers.
The VCL has all the operations you want such as *,+,/,-,+=,-=,/=,*=, multiply and divide by scalar and many more features.
The main difference with OpenCL and the VCL is that OpenCL basically creates a CPU dispatcher. Whereas with the VCL you have to write a CPU dispatcher yourself (it includes some example code to do this with documentation). The VCL has optimized functions for SSE2 through AVX512 so you can target several different instruction sets. There is even a special version of the VCL for the Knights Corner Xeon Phi.
Another feature from OpenCL that I miss is the syntax for permuting. In OpenCL to reverse the order of the components of float4 you could do v.wzyx whereas with the VCL you would do permute4f<3,2,1,0>(v). I might be possible to create this syntax with C++ but I am not sure.
Using the VCL, OpenMP, and a custom CPU dispatcher I have largely replaced OpenCL on the CPU.