SSE2: How To Load Data From Non-Contiguous Memory Locations?

Question

I'm trying to vectorize some extremely performance critical code. At a high level, each loop iteration reads six floats from non-contiguous positions in a small array, then converts these values to double precision and adds them to six different double precision accumulators. These accumulators are the same across iterations, so they can live in registers. Due to the nature of the algorithm, it's not feasible to make the memory access pattern contiguous. The array is small enough to fit in L1 cache, though, so memory latency/bandwidth isn't a bottleneck.

I'm willing to use assembly language or SSE2 intrinsics to parallelize this. I know I need to load two floats at a time into the two lower dwords of an XMM register, convert them to two doubles using cvtps2pd, then add them to two accumulators at a time using addpd.

My question is, how do I get the two floats into the two lower dwords of a single XMM register if they aren't adjacent to each other in memory? Obviously any technique that's so slow that it defeats the purpose of parallelization isn't useful. An answer in either ASM or Intel/GCC intrinsics would be appreciated.

EDIT:

The size of the float array is, strictly speaking, not known at compile time but it's almost always 256, so this can be special cased.
The element of the float array that should be read is determined by loading a value from a byte array. There are six byte arrays, one for each accumulator. The reads from the byte array are sequential, one from each array for each loop iteration, so there shouldn't be many cache misses there.
The access pattern of the float array is for all practical purposes random.

You're probably looking for the gather instruction (there's a whole family there, depending on your CPU model. Haswell should have some nice improvements there) — Leeor
@Leeor: I don't think that gather is part of the SSE2 instruction set. Isn't it AVX2? dsimcha, is AVX2 available on your machine? — Nathan Fellman
Start by trying _mm_set_ps or _mm_set_pd and see what code your compile generates (with optimisation enabled) - it may surprise you. — Paul R
Tell us more about the layout pattern of these floats in memory. If they are regularly spaced, maybe interleaved processing is possible ? Can you provide pseudo-code ? — Yves Daoust

gsg gsg · Accepted Answer · 2014-05-25T05:12:35

For this specific case, take a look at the unpack-and-interleave instructions in your instruction reference manual. It would be something like

movss xmm0, <addr1>
movss xmm1, <addr2>
unpcklps xmm0, xmm1

Also take a look at shufps, which is handy whenever you have the data you want in the wrong order.

SSE2: How To Load Data From Non-Contiguous Memory Locations?

2 Answers