I'm looking to calculate highly parallelized trig functions (in block of like 1024), and I'd like to take advantage of at least some of the parallelism that modern architectures have.
When I compile a block
for(int i=0; i<SIZE; i++) {
arr[i]=sin((float)i/1024);
}
GCC won't vectorize it, and says
not vectorized: relevant stmt not supported: D.3068_39 = __builtin_sinf (D.3069_38);
Which makes sense to me. However, I'm wondering if there's a library to do parallel trig computations.
With just a simple taylor series up the 11th order, GCC will vectorize all the loops, and I'm getting speeds over twice as fast as a naive sin loop (with bit-exact answers, or with 9th order series, only a single bit off for the last two out of 1600 values, for a >3x speedup). I'm sure someone has encountered a problem like this before, but when I google, I find no mentions of any libraries or the like.
A. Is there something existing already?
B. If not, advice for optimizing parallel trig functions?
EDIT: I found the following library called "SLEEF": http://shibatch.sourceforge.net/ which is described in this paper and uses SIMD instructions to calculate several elementary functions. It uses SSE and AVX specific code, but I don't think it will be hard to turn it into standard C loops.