I'm trying to vectorize some extremely performance critical code. At a high level, each loop iteration reads six floats from non-contiguous positions in a small array, then converts these values to double precision and adds them to six different double precision accumulators. These accumulators are the same across iterations, so they can live in registers. Due to the nature of the algorithm, it's not feasible to make the memory access pattern contiguous. The array is small enough to fit in L1 cache, though, so memory latency/bandwidth isn't a bottleneck.
I'm willing to use assembly language or SSE2 intrinsics to parallelize this. I know I need to load two floats at a time into the two lower dwords of an XMM register, convert them to two doubles using cvtps2pd
, then add them to two accumulators at a time using addpd
.
My question is, how do I get the two floats into the two lower dwords of a single XMM register if they aren't adjacent to each other in memory? Obviously any technique that's so slow that it defeats the purpose of parallelization isn't useful. An answer in either ASM or Intel/GCC intrinsics would be appreciated.
EDIT:
The size of the float array is, strictly speaking, not known at compile time but it's almost always 256, so this can be special cased.
The element of the float array that should be read is determined by loading a value from a byte array. There are six byte arrays, one for each accumulator. The reads from the byte array are sequential, one from each array for each loop iteration, so there shouldn't be many cache misses there.
The access pattern of the float array is for all practical purposes random.
gather
instruction (there's a whole family there, depending on your CPU model. Haswell should have some nice improvements there) – Leeorgather
is part of the SSE2 instruction set. Isn't it AVX2? dsimcha, is AVX2 available on your machine? – Nathan Fellman_mm_set_ps
or_mm_set_pd
and see what code your compile generates (with optimisation enabled) - it may surprise you. – Paul R