GCC w/ inline assembly & -Ofast generating extra code for memory operand

Question

I am inputting the address of an index into a table into an extended inline assembly operation, but GCC is producing an extra lea instruction when it is not necessary, even when using -Ofast -fomit-frame-pointer or -Os -f.... GCC is using RIP-relative addresses.

I was creating a function for converting two consecutive bits into a two-part XMM mask (1 quadword mask per bit). To do this, I am using _mm_cvtepi8_epi64 (internally vpmovsxbq) with a memory operand from a 8-byte table with the bits as index.

When I use the intrinsic, GCC produces the exactly same code as using the extended inline assembly.

I can directly embed the memory operation into the ASM template, but that would force RIP-relative addressing always (and I don't like forcing myself into workarounds).

typedef uint64_t xmm2q __attribute__ ((vector_size (16)));

// Used for converting 2 consecutive bits (as index) into a 2-elem XMM mask (pmovsxbq)
static const uint16_t MASK_TABLE[4] = { 0x0000, 0x0080, 0x8000, 0x8080 };

xmm2q mask2b(uint64_t mask) {
    assert(mask < 4);
    #ifdef USE_ASM
        xmm2q result;
        asm("vpmovsxbq %1, %0" : "=x" (result) : "m" (MASK_TABLE[mask]));
        return result;
    #else
        // bad cast (UB?), but input should be `uint16_t*` anyways
        return (xmm2q) _mm_cvtepi8_epi64(*((__m128i*) &MASK_TABLE[mask]));
    #endif
}

Output assembly with -S (with USE_ASM and without):

__Z6mask2by:                            ## @_Z6mask2by
        .cfi_startproc
## %bb.0:
        leaq    __ZL10MASK_TABLE(%rip), %rax
        vpmovsxbq       (%rax,%rdi,2), %xmm0
        retq
        .cfi_endproc

What I was expecting (I've removed all the extra stuff):

__Z6mask2by:
        vpmovsxbq __ZL10MASK_TABLE(%rip,%rdi,2), %xmm0
        retq

@VictorGubin: Godbolt GCC is configured with non-PIE as the default. The OP is clearly using a GCC configured with --enable-default-pie. See 32-bit absolute addresses no longer allowed in x86-64 Linux?&lq=1 — Peter Cordes
In any case, I'm on a Macbook Pro with GCC version Apple LLVM version 10.0.1 (clang-1001.0.46.4). Oh crap, it's just Clang in disguise. — Arav K.

Peter Cordes Peter Cordes · Accepted Answer · 2019-07-11T08:50:08

The only RIP-relative addressing mode is RIP + rel32. RIP + reg is not available.

(In machine code, 32-bit code used to have 2 redundant ways to encode [disp32]. x86-64 uses the shorter (no SIB) form as RIP relative, the longer SIB form as [sign_extended_disp32]).

If you compile for Linux with -fno-pie -no-pie, GCC will be able to access static data with a 32-bit absolute address, so it can use a mode like __ZL10MASK_TABLE(,%rdi,2). This isn't possible for MacOS, where the base address is always above 2^32; 32-bit absolute addressing is completely unsupported on x86-64 MacOS.

In a PIE executable (or PIC code in general like a library), you need a RIP-relative LEA to set up for indexing a static array. Or any other case where the static address won't fit in 32 bits and/or isn't a link-time constant.

Intrinsics

Yes, intrinsics make it very inconvenient to express a pmovzx/sx load from a narrow source because pointer-source versions of the intrinsics are missing.

*((__m128i*) &MASK_TABLE[mask] isn't safe: if you disable optimization, you might well get a movdqa 16-byte load but the address will be misaligned. It's only safe when the compiler folds the load into a memory operand for pmovzxbq which has a 2-byte memory operand therefore not requiring alignment.

In fact current GCC does compile your code with a movdqa 16-byte load like movdqa xmm0, XMMWORD PTR [rax+rdi*2] before a reg-reg pmovzx. This is obviously a missed optimization. :( clang/LLVM (which MacOS installs as gcc) does fold the load into pmovzx.

The safe way is _mm_cvtepi8_epi64( _mm_cvtsi32_si128(MASK_TABLE[mask]) ) or something, and then hoping the compiler optimizes away the zero-extend from 2 to 4 bytes and folds the movd into a load when you enable optimization. Or maybe try _mm_loadu_si32 for a 32-bit load even though you really want 16. But last time I tried, compilers sucked at folding a 64-bit load intrinsic into a memory operand for pmovzxbw for example. GCC and clang still fail at it, but ICC19 succeeds. https://godbolt.org/z/IdgoKV

I've written about this before:

Your integer -> vector strategy

Your choice of pmovsx seems odd. You don't need sign-extension, so I would have picked pmovzx (_mm_cvt_epu8_epi64). It's not actually more efficient on any CPUs, though.

A lookup table does work here with only a small amount of static data needed. If your mask range was any bigger, you'd maybe want to look into is there an inverse instruction to the movemask instruction in intel avx2? for alternative strategies like broadcast + AND + (shift or compare).

If you do this often, using a whole cache line of 4x 16-byte vector constants might be best so you don't need a pmovzx instruction, just index into an aligned table of xmm2 or __m128i vectors which can be a memory source for any other SSE instruction. Use alignas(64) to get all the constants in the same cache line.

You could also consider (intrinsics for) pdep + movd xmm0, eax + pmovzxbq reg-reg if you're targeting Intel CPUs with BMI2. (pdep is slow on AMD, though).

GCC w/ inline assembly & -Ofast generating extra code for memory operand

1 Answers

Intrinsics

Your integer -> vector strategy