7
votes

I'm trying to vectorize some extremely performance critical code. At a high level, each loop iteration reads six floats from non-contiguous positions in a small array, then converts these values to double precision and adds them to six different double precision accumulators. These accumulators are the same across iterations, so they can live in registers. Due to the nature of the algorithm, it's not feasible to make the memory access pattern contiguous. The array is small enough to fit in L1 cache, though, so memory latency/bandwidth isn't a bottleneck.

I'm willing to use assembly language or SSE2 intrinsics to parallelize this. I know I need to load two floats at a time into the two lower dwords of an XMM register, convert them to two doubles using cvtps2pd, then add them to two accumulators at a time using addpd.

My question is, how do I get the two floats into the two lower dwords of a single XMM register if they aren't adjacent to each other in memory? Obviously any technique that's so slow that it defeats the purpose of parallelization isn't useful. An answer in either ASM or Intel/GCC intrinsics would be appreciated.

EDIT:

  1. The size of the float array is, strictly speaking, not known at compile time but it's almost always 256, so this can be special cased.

  2. The element of the float array that should be read is determined by loading a value from a byte array. There are six byte arrays, one for each accumulator. The reads from the byte array are sequential, one from each array for each loop iteration, so there shouldn't be many cache misses there.

  3. The access pattern of the float array is for all practical purposes random.

2
You're probably looking for the gather instruction (there's a whole family there, depending on your CPU model. Haswell should have some nice improvements there)Leeor
@Leeor: I don't think that gather is part of the SSE2 instruction set. Isn't it AVX2? dsimcha, is AVX2 available on your machine?Nathan Fellman
@NathanFellman, ah, sorry, I missed that as a restrictionLeeor
Start by trying _mm_set_ps or _mm_set_pd and see what code your compile generates (with optimisation enabled) - it may surprise you.Paul R
Tell us more about the layout pattern of these floats in memory. If they are regularly spaced, maybe interleaved processing is possible ? Can you provide pseudo-code ?Yves Daoust

2 Answers

3
votes

For this specific case, take a look at the unpack-and-interleave instructions in your instruction reference manual. It would be something like

movss xmm0, <addr1>
movss xmm1, <addr2>
unpcklps xmm0, xmm1

Also take a look at shufps, which is handy whenever you have the data you want in the wrong order.

0
votes

I think it would be interesting to see how this performs using the lookup functions from Agner Fog's Vector Class Library. It's not a library you need compile and link in. It's just a collection of header files. If you drop the header files into your source code directory then the following code should compile. The code below loads 16 bytes at a time from each of the six byte arrays, extends them to 32-bit integers (because the lookup function requires that), and then gathers floats for each of the six accumulators. You could probably extend this to AVX as well. I don't know if it this will be any better in performance (it could be worse). My guess is that if there was a regular pattern it could help (in that case the gather function would be better) but in any case it's worth a try.

#include "vectorclass.h"    
int main() {
    const int n = 16*10;
    float x[256];
    char b[6][n];
    Vec4f sum[6];
    for(int i=0; i<6; i++) sum[i] = 0;
    for(int i=0; i<n; i+=16) {
        Vec4i in[6][4];
        for(int j=0; j<6; j++) {
            Vec16c b16 = Vec16uc().load(&b[j][i]);      
            Vec8s low,high;
            low = extend_low(b16);
            high = extend_high(b16);
            in[j][0] = extend_low(low);
            in[j][1] = extend_high(low);
            in[j][2] = extend_low(high);
            in[j][3] = extend_high(high);
        }
        for(int j=0; j<4; j++) {
            sum[0] += lookup<256>(in[0][j], x);
            sum[1] += lookup<256>(in[1][j], x);
            sum[2] += lookup<256>(in[2][j], x);
            sum[3] += lookup<256>(in[3][j], x);
            sum[4] += lookup<256>(in[4][j], x);
            sum[5] += lookup<256>(in[5][j], x);
        }
    }
}