15
votes

Is it my imagination, or is a PNOT instruction missing from SSE and AVX? That is, an instruction which flips every bit in the vector.

If yes, is there a better way of emulating it than PXOR with a vector of all 1s? Quite annoying since I need to set up a vector of all 1s to use that approach.

4
Setting up a vector of all 1s is not particularly difficult: [v]pcmpe[typesize] %[x/y]mmN, %[x/y]mmN[, %[x/y]mmN] or thereabouts. A single instruction to set up the constant does not seem too onerous. If you have a particular aversion to xor, pandn and andnps are also available. - EOF
It's not terrible - but it's 2x as long as I'd expect for a basic operation like this. Of course, the constant could be hoisted, at the expense of a register. Anyway, just checking my assumption that I wasn't missing this somewhere. @EOF - SODIMM
I agree in general. It matters in my case. I am throughput and port constrained on the 3 vector ports. Every vector operation costs me 1/3 of a cycle (within reason). @EOF - SODIMM
There is a ANDNPD (and-not) in SSE. - Chuck Walbourn
Similarly: where's the PNEG instruction? - Joost

4 Answers

15
votes

For cases such as this it can be instructive to see what a compiler would generate.

E.g. for the following function:

#include <immintrin.h>

__m256i test(const __m256i v)
{
  return ~v;
}

both gcc and clang seem to generate much the same code:

test(long long __vector(4)):
        vpcmpeqd        ymm1, ymm1, ymm1
        vpxor   ymm0, ymm0, ymm1
        ret
6
votes

If you use Intrinsics you can use an inline function like this to have the not operation separately.

 inline __m256i _mm256_not_si256 (__m256i a){    
     //return  _mm256_xor_si256 (a, _mm256_set1_epi32(0xffffffff));
     return  _mm256_xor_si256 (a, _mm256_cmpeq_epi32(a,a));//I didn't check wich one is faster   
 }
4
votes

AVX512F vpternlogd / _mm512_ternarylogic_epi32(__m512i a, __m512i b, __m512i c, int imm8) finally provides a way to implement NOT without any extra constants, using a single instruction which can run on any vector ALU port on Skylake-avx512.

And with AVX512VL, for 128 and 256-bit vectors as well without dirtying the upper part of a ZMM. (All AVX512 CPUs except Xeon Phi have AVX512VL).

On Skylake-X, some mysterious effect limits throughput to ~0.48/clock even for 128 and 256-bit vectors when running just this instruction in an unrolled loop, even with 6 to 10 separate dependency chains, even though it can run on any of p015. Ice Lake achieves the expected 3/clock throughput. (https://www.uops.info/html-instr/VPTERNLOGD_XMM_XMM_XMM_I8.html).

The ZMM version runs 2/clock everywhere, with port 1 SIMD ALUs shut down on SKX/ICL because 512-bit uops are in flight.


vpternlogd zmm,zmm,zmm, imm8 has 3 input vectors and one output, modifying the destination in place. With the right immediate, you can still implement a copy-and-NOT into a different register, but it will have a "false" dependency on the output register (which vpxord dst, src, all-ones wouldn't).

TL:DR: probably still use xor with all-ones as part of a loop, unless you're running out of registers. vpternlog may cost an extra vmovdqa register-copy instruction if its input is needed later.

Outside of loops, vpternlogd zmm,zmm,zmm, 0xff is the compiler's best option for creating a 512b vector of all-ones in the first place, because AVX512 compare instructions compare into masks (k0-k7), so XOR with all-ones might already involve a vpternlogd, or maybe a broadcast-constant from memory, for 512-bit vectors. Or a dep-breaking ALU uop for 128 or 256-bit vpcmpeqd same,same.


For each bit position i, the output bit is imm[ (DEST[i]<<2) + (SRC1[i]<<1) + SRC2[i]], where the imm8 is treated as an 8-element bitmap.

Thus, if we want the result to depend only on SRC2 (which is the zmm/m512/m32bcst operand), we should choose a bitmap of repeating 1,0, with 1 at the even positions (selected by src2=0).

vpternlogd  zmm1,zmm1, zmm2,  01010101b  ; 0x55  ; false dep on zmm1

If you're lucky, a compiler will optimize _mm512_xor_epi32(v, _mm512_set1_epi32(-1)) to vpternlogd for you if it's profitable.

// To hand-hold a compiler into saving a vmovdqa32 if needed:
__m512i tmp = something earlier;
__m512i t2 = _mm...(tmp);
// use-case: tmp is dead, t2 and ~t2 are both needed.
__m512i t2_inv = _mm512_ternarylogic_epi32(tmp, t2, t2, 0b01010101);

If you're not sure that's a good idea, just keep it simple and use the same variable for all 3 inputs:

__m512i t2_inv = _mm512_ternarylogic_epi32(t2, t2, t2, 0b01010101);
3
votes

You can use the PANDN OpCode for that.

PANDN implements the operation

DEST = NOT(DEST) AND SRC   ; (SSEx)

or

DEST = NOT(SRC1) AND SRC2  ; (AVXx)

Combining this operation with an all-ones vector effectively results in a PNOT operation.


Some x86(SSEx) assembly code would look like this:

; XMM0 is input register
PCMPEQB   xmm1, xmm1        ; Whole xmm1 reg set to 1's
PANDN     xmm0, xmm1        ; xmm0 = NOT(xmm0) AND xmm1
; XMM0 contains NOT(XMM0)

Some x86(AVXx) assembly code would look like this:

; YMM0 is input register
VPCMPEQB  ymm1, ymm1, ymm1  ; Whole ymm1 reg set to 1's
VPANDN    ymm0, ymm0, ymm1  ; ymm0 = NOT(ymm0) AND ymm1
; YMM0 contains NOT(YMM0)

Both can (of course) easily be translated to intrinsics.