3
votes

I tried to find a KNC broadcast instruction for Xeon Phi platform. But I could not find any instruction. Instead I tried to use this AVX _mm512_set1_epi32 intrinsic in assembly. I have two questions: first is there any KNC broadcast instruction? Second, when I compiled the below code, I got the operand type mismatch for `vpbroadcastd' error.

int op = 2;
__asm__("vmovdqa32 %0,%%zmm0\n\t"
            "mov %1, %%eax\n\t"
            "vpbroadcastd %%eax, %%zmm1\n\t"
            "vpsravd %%zmm1,%%zmm0,%%zmm1\n\t"
            "vmovdqa32 %%zmm1,%0;"
            : "=m" (tt[0]): "m" (op));

which tt defined using below code and I used k1om-mpss-linux-gcc compiler for compiling this code

int * tt = (int *) aligned_malloc(16 * sizeof(int),64);
2
According the Xeon Phi Instruction Set manual VPBROADCASTD only takes a memory location as the source operand. The AVX 2 version takes either a memory location or an XMM register. Neither allows EAX as the source.Ross Ridge
@RossRidge thank you for your reply. My question is what is the right way to use broadcast or set instruction in Xeon Phi instruction?Hamid_UMB
@RossRidge: regular AVX512F does allow broadcast from a GP register. Xeon Phi doesn't have that? That would explain the problem. In that case, the solution is just to not load into eax first, since the OP is forcing the compiler to put it in memory anyway.Peter Cordes
@RossRidge: Peter suggest the _mm512_set1_epi32(int) intrinsic. I just want to know the assembly version of this instruction. I don't know what is r32? and how I can load to r32?Hamid_UMB
I think maybe you've bitten off more than you can chew here. As Intel's documentations states there is no assembly instruction that corresponds to _mm512_set1_epi32. Ideally this intrinsic doesn't generate an instruction, the broadcast is done for free using the {1to16} operand transformation. To make effective use of assembly with the Xeon Phi you need to know things like that. You shouldn't be asking about basic things like the meaning of r32.Ross Ridge

2 Answers

3
votes

An earlier version of this answer was wrong. According to An Intels PDF of the KNC insn from Sep 2012, which I hope is current/up-to-date, 512b vpsrad is only available with an immediate count. It does appear rather inconvenient when you have the count in a GP register (rather than memory).

It appears that the variable-count shift (vpsravd) is the only way to do non-immediate-count shifts on KNC, even with the same count for every element. Since it can use a broadcast load for the shift count, that's not a huge problem. KNC also appears to have a "swizzle" shuffle or broadcast from a register source (zmm1 {aaaa}), but I'm not sure what the width of that broadcast is.

This doesn't compile on a normal compiler: the {1to16} is ignored, and you get an error that "broadcast is needed for operand of such type for `vpsravd'". IDK if that's just a syntax problem, with intel-syntax instead of AT&T.

// compile with -masm=intel
// todo: something clever to use vpsrad when the shift count is a compile-time constant
void shift_KNC(int *A, int n) {

  __asm__ volatile(
    // ".intel_syntax noprefix\n"
    "vmovdqa32      zmm0, %0\n\t"
    "vpsravd        zmm0, zmm0, %1 {1to16}\n\t"
    "vmovdqa32      %0,  zmm0\n\t"
    : "+m" (*(__m512i*)A)
    : "m" (n) /* force it to memory */
    : "%zmm0"
  );
}

Still using a full "memory" clobber because we're only telling the compiler about using the first integer as an input/output memory operand, not the next 16.

If you can keep the zmm value in memory, instead of storing/reloading between tiny fragments of inline asm, that will perform much better.


According to Xeon Phi Knights Corner intrinsics with GCC, gcc doesn't support intrinsics for KNC.


I think the PDF I have is for AVX512 (KNL/Skylake-E). IDK about KNC; it may not have this. (specifically: Intel® Architecture Instruction Set Extensions Programming Reference, from Oct 2014.)

There is a GP-register source form of VPBROADCASTD, requiring only AVX512F. VPBROADCASTD zmm1 {k1}{z}, r32. The intrinsic is

__m512i _mm512_maskz_set1_epi32( __mmask16 k, int a);

There isn't one listed without the mask, but maybe try just _mm512_set1_epi32(int).

BTW, your inline assembly compiles ok with a normal compiler on godbolt. (The "binary" checkbox makes it actually assemble and then disassemble, so I'm sure the instructions were accepted.)

If you still go with inline asm, instead of intrinsics, make sure you tidy up your code: If you're going to require the compiler to put op in memory, use a broadcast-load, rather than a mov into a GP register and broadcasting from there. Even better, use a broadcast-load memory operand for vpsravd: VPSRAVD zmm1 {k1}{z}, zmm2, zmm3/m512/m32bcst. Then you never need a VPBROADCAST instruction at all. (I assume the compiler would do this with intrinsics.)

3
votes

I looked at how AVX2 would do this with intrinsics and noticed that the broadcast reads from memory just like with KNC. Looking at the assembly from the AVX2 intrinsics I wrote inline assembly which does the same thing.

#include <stdio.h>
#include <x86intrin.h>
void foo(int *A, int n) {
    __m256i a16 = _mm256_loadu_si256((__m256i*)A);
    __m256i t = _mm256_set1_epi32(n);
    __m256i s16 = _mm256_srav_epi32(a16,t);
    _mm256_storeu_si256((__m256i*)A, s16);
}

void foo2(int *A, int n) {
    __asm__("vmovdqu      (%0),%%ymm0\n"
            "vpbroadcastd (%1), %%ymm1\n"
            "vpsravd      %%ymm1, %%ymm0, %%ymm0\n"
            "vmovdqu      %%ymm0, (%0)"
            :
            : "r" (A), "r" (&n)
            : "memory"
        );
}

int main(void) {
    int x[8];
    for(int i=0; i<8; i++) x[i] = 1<<i;
    for(int i=0; i<8; i++) printf("%8d ", x[i]); puts("");
    foo2(x,2);
    for(int i=0; i<8; i++) printf("%8d ", x[i]); puts("");
}

Here is my guess for KNC (using aligned loads):

void foo2_KNC(int *A, int n) {
    __asm__("vmovdqa32      (%0),%%zmm0\n"
            "vpbroadcastd   (%1), %%zmm1\n"
            "vpsravd        %%zmm1, %%zmm0, %%zmm0\n"
            "vmovdqa32      %%zmm0, (%0)"
            :
            : "r" (A), "r" (&n)
            : "memory"
        );
}

There appears to be a more efficient way of doing this with KNC and AVX512.

Intel says in regards to AVX12 in section "2.5.3 Broadcast":

EVEX encoding provides a bit-field to encode data broadcast for some load-op instructions

and then gives the example

vmulps zmm1, zmm2, [rax] {1to16}

where

The {1to16} primitive loads one float32 (single precision) elem ent from memory, replicates it 16 times to form a vector of 16 32-bit floating-point elements, multiplies the 16 float32 elements with the corresponding elements in the first source operand vector, and put each of the 16 results into the destination operand.

I have never used his syntax before but you could try

void foo2_KNC(int *A, int n) {
__asm__("vmovdqa32      (%0),%%zmm0\n\t"
        "vpsravd        (%1)%{1to16}, %%zmm0, %%zmm0\n\t"
        "vmovdqa32      %%zmm0, (%0)\t"
        :
        : "r" (A), "r" (&n)
        : "memory", "%zmm0"
    );

}

this produces

vmovdqa32      (%rax),%zmm0
vpsravd        (%rdx){1to16}, %zmm0, %zmm0
vmovdqa32      %zmm0, (%rax)

Agner Fog incidentally has a section titled "8.4 Assembly syntax for AVX-512 and Knights Corner instructions" in the documentation for objconv where he says

these two instruction sets are very similar, but have different optional instruction attributes. Instructions from these two instruction sets differ by a single bit in the prefix, even for otherwise identical instructions.

According to his documentation NASM supports the AVX-512 and KNC syntax so you could try this syntax in NASM.