operand type mismatch for `vpbroadcastd'

Question

I tried to find a KNC broadcast instruction for Xeon Phi platform. But I could not find any instruction. Instead I tried to use this AVX _mm512_set1_epi32 intrinsic in assembly. I have two questions: first is there any KNC broadcast instruction? Second, when I compiled the below code, I got the operand type mismatch for `vpbroadcastd' error.

int op = 2;
__asm__("vmovdqa32 %0,%%zmm0\n\t"
            "mov %1, %%eax\n\t"
            "vpbroadcastd %%eax, %%zmm1\n\t"
            "vpsravd %%zmm1,%%zmm0,%%zmm1\n\t"
            "vmovdqa32 %%zmm1,%0;"
            : "=m" (tt[0]): "m" (op));

which tt defined using below code and I used k1om-mpss-linux-gcc compiler for compiling this code

int * tt = (int *) aligned_malloc(16 * sizeof(int),64);

According the Xeon Phi Instruction Set manual VPBROADCASTD only takes a memory location as the source operand. The AVX 2 version takes either a memory location or an XMM register. Neither allows EAX as the source. — Ross Ridge
@RossRidge thank you for your reply. My question is what is the right way to use broadcast or set instruction in Xeon Phi instruction? — Hamid_UMB
@RossRidge: regular AVX512F does allow broadcast from a GP register. Xeon Phi doesn't have that? That would explain the problem. In that case, the solution is just to not load into eax first, since the OP is forcing the compiler to put it in memory anyway. — Peter Cordes
@RossRidge: Peter suggest the _mm512_set1_epi32(int) intrinsic. I just want to know the assembly version of this instruction. I don't know what is r32? and how I can load to r32? — Hamid_UMB
I think maybe you've bitten off more than you can chew here. As Intel's documentations states there is no assembly instruction that corresponds to _mm512_set1_epi32. Ideally this intrinsic doesn't generate an instruction, the broadcast is done for free using the {1to16} operand transformation. To make effective use of assembly with the Xeon Phi you need to know things like that. You shouldn't be asking about basic things like the meaning of r32. — Ross Ridge

Peter Cordes Peter Cordes · Accepted Answer · 2015-12-20T00:17:58

An earlier version of this answer was wrong. According to An Intels PDF of the KNC insn from Sep 2012, which I hope is current/up-to-date, 512b vpsrad is only available with an immediate count. It does appear rather inconvenient when you have the count in a GP register (rather than memory).

It appears that the variable-count shift (vpsravd) is the only way to do non-immediate-count shifts on KNC, even with the same count for every element. Since it can use a broadcast load for the shift count, that's not a huge problem. KNC also appears to have a "swizzle" shuffle or broadcast from a register source (zmm1 {aaaa}), but I'm not sure what the width of that broadcast is.

This doesn't compile on a normal compiler: the {1to16} is ignored, and you get an error that "broadcast is needed for operand of such type for `vpsravd'". IDK if that's just a syntax problem, with intel-syntax instead of AT&T.

// compile with -masm=intel
// todo: something clever to use vpsrad when the shift count is a compile-time constant
void shift_KNC(int *A, int n) {

  __asm__ volatile(
    // ".intel_syntax noprefix\n"
    "vmovdqa32      zmm0, %0\n\t"
    "vpsravd        zmm0, zmm0, %1 {1to16}\n\t"
    "vmovdqa32      %0,  zmm0\n\t"
    : "+m" (*(__m512i*)A)
    : "m" (n) /* force it to memory */
    : "%zmm0"
  );
}

Still using a full "memory" clobber because we're only telling the compiler about using the first integer as an input/output memory operand, not the next 16.

If you can keep the zmm value in memory, instead of storing/reloading between tiny fragments of inline asm, that will perform much better.

According to Xeon Phi Knights Corner intrinsics with GCC, gcc doesn't support intrinsics for KNC.

I think the PDF I have is for AVX512 (KNL/Skylake-E). IDK about KNC; it may not have this. (specifically: Intel® Architecture Instruction Set Extensions Programming Reference, from Oct 2014.)

There is a GP-register source form of VPBROADCASTD, requiring only AVX512F. VPBROADCASTD zmm1 {k1}{z}, r32. The intrinsic is

__m512i _mm512_maskz_set1_epi32( __mmask16 k, int a);

There isn't one listed without the mask, but maybe try just _mm512_set1_epi32(int).

BTW, your inline assembly compiles ok with a normal compiler on godbolt. (The "binary" checkbox makes it actually assemble and then disassemble, so I'm sure the instructions were accepted.)

If you still go with inline asm, instead of intrinsics, make sure you tidy up your code: If you're going to require the compiler to put op in memory, use a broadcast-load, rather than a mov into a GP register and broadcasting from there. Even better, use a broadcast-load memory operand for vpsravd: VPSRAVD zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst. Then you never need a VPBROADCAST instruction at all. (I assume the compiler would do this with intrinsics.)

operand type mismatch for `vpbroadcastd'

2 Answers