I looked at how AVX2 would do this with intrinsics and noticed that the broadcast reads from memory just like with KNC. Looking at the assembly from the AVX2 intrinsics I wrote inline assembly which does the same thing.
#include <stdio.h>
#include <x86intrin.h>
void foo(int *A, int n) {
__m256i a16 = _mm256_loadu_si256((__m256i*)A);
__m256i t = _mm256_set1_epi32(n);
__m256i s16 = _mm256_srav_epi32(a16,t);
_mm256_storeu_si256((__m256i*)A, s16);
}
void foo2(int *A, int n) {
__asm__("vmovdqu (%0),%%ymm0\n"
"vpbroadcastd (%1), %%ymm1\n"
"vpsravd %%ymm1, %%ymm0, %%ymm0\n"
"vmovdqu %%ymm0, (%0)"
:
: "r" (A), "r" (&n)
: "memory"
);
}
int main(void) {
int x[8];
for(int i=0; i<8; i++) x[i] = 1<<i;
for(int i=0; i<8; i++) printf("%8d ", x[i]); puts("");
foo2(x,2);
for(int i=0; i<8; i++) printf("%8d ", x[i]); puts("");
}
Here is my guess for KNC (using aligned loads):
void foo2_KNC(int *A, int n) {
__asm__("vmovdqa32 (%0),%%zmm0\n"
"vpbroadcastd (%1), %%zmm1\n"
"vpsravd %%zmm1, %%zmm0, %%zmm0\n"
"vmovdqa32 %%zmm0, (%0)"
:
: "r" (A), "r" (&n)
: "memory"
);
}
There appears to be a more efficient way of doing this with KNC and AVX512.
Intel says in regards to AVX12 in section "2.5.3 Broadcast":
EVEX encoding provides a bit-field to encode data broadcast for some load-op instructions
and then gives the example
vmulps zmm1, zmm2, [rax] {1to16}
where
The {1to16} primitive loads one float32 (single precision) elem
ent from memory, replicates it 16 times to form a
vector of 16 32-bit floating-point elements, multiplies the
16 float32 elements with the corresponding elements in
the first source operand vector, and put each of the 16 results into the destination operand.
I have never used his syntax before but you could try
void foo2_KNC(int *A, int n) {
__asm__("vmovdqa32 (%0),%%zmm0\n\t"
"vpsravd (%1)%{1to16}, %%zmm0, %%zmm0\n\t"
"vmovdqa32 %%zmm0, (%0)\t"
:
: "r" (A), "r" (&n)
: "memory", "%zmm0"
);
}
this produces
vmovdqa32 (%rax),%zmm0
vpsravd (%rdx){1to16}, %zmm0, %zmm0
vmovdqa32 %zmm0, (%rax)
Agner Fog incidentally has a section titled "8.4 Assembly syntax for AVX-512 and Knights Corner instructions" in the documentation for objconv where he says
these two instruction sets are very similar, but have different optional instruction attributes. Instructions from these two instruction sets differ by a single bit in the prefix, even for otherwise identical instructions.
According to his documentation NASM supports the AVX-512 and KNC syntax so you could try this syntax in NASM.
_mm512_set1_epi32
. Ideally this intrinsic doesn't generate an instruction, the broadcast is done for free using the{1to16}
operand transformation. To make effective use of assembly with the Xeon Phi you need to know things like that. You shouldn't be asking about basic things like the meaning ofr32
. – Ross Ridge