6
votes

The Intel Xeon Phi "Knights Landing" processor will be the first to support AVX-512, but it will only support "F" (like SSE without SSE2, or AVX without AVX2), so floating-point stuff mainly.

I'm writing software that operates on bytes and words (8- and 16-bit) using up to SSE4.1 instructions via intrinsics.

I am confused whether there will be EVEX-encoded versions of all/most SSE4.1 instructions in AVX-512F, and whether this means I can expect my SSE code to automatically gain EVEX-extended instructions and map to all new registers.

Wikipedia says this:

The width of the SIMD register file is increased from 256 bits to 512 bits, with a total of 32 registers ZMM0-ZMM31. These registers can be addressed as 256 bit YMM registers from AVX extensions and 128-bit XMM registers from Streaming SIMD Extensions, and legacy AVX and SSE instructions can be extended to operate on the 16 additional registers XMM16-XMM31 and YMM16-YMM31 when using EVEX encoded form.

This unfortunately does not clarify whether compiling SSE4 code with AVX512-enabled will lead to the same (awesome) speedup that compiling it to AVX2 provides (VEX coding of legacy instructions).

Anybody know what will happen when SSE2/4 code (C intrinsics) are compiled for AVX-512F? Could one expect a speed bump like with AVX1's VEX coding of the byte and word instructions?

1
I may have answered my own question with more looking. See the last sentence of this: en.wikipedia.org/wiki/AVX-512#SIMD_modes ... Looks like SSE/AVX instructions operating on bytes and words will NOT share a namespace with the new registers until AVX512BW. Any clarification if this actually means something performance-wise?user1649948
You might want to wait for Purley (next year, supposedly) - it will have the AVX-512BW additions.Paul R
AVX-512F will be supported by both "Big Core"(Xeon) and "Throughput hpc accelerator" (Xeon Phi). But Xeon Phi and Big Core will also have additional unique AVX-512 instruction sets, targeted to Big Core users exclusively or to "Throughput" uses exclusively. AVX-512BW is exclusive for Big core, while e.g. AVX-512ER (reciprocals) is exclusive to Xeon Phi. I'm not sure if it's "performance wise", but it should be "power-perfomance wise" and a little bit FP-focus wise (since Xeon Phi targets more FP-oriented power-sensitive throughput-focused users).zam
In continuation of previous comment: it may happen that longer term Big Core and Phi ISA will have more cross-pollination of -BW or -ER ISA (who knows), but it's not the case in current momentum.zam
Interestingly, I'm both memory bandwidth bound as well as compute-bound (there are some constants that control how the algo shifts in each case). So with Phi I can go crazy on the memory, and with Big Core I can go crazy on the compute (and use less cache). Cross-pollination would indeed be good...user1649948

1 Answers

4
votes

Okay, I think I've pieced together enough information to make a decent answer. Here goes.

What will happen when native SSE2/4 code is run on Knights Landing (KNL)?

The code will run in the bottom fourth of the registers on a single VPU (called the compatibility layer) within a core. According to a pre-release webinar from Colfax, this means occupying only 1/4 to 1/8 of the total register space available to a core and running in legacy mode.

What happens if the same code is recompiled with compiler flags for AVX-512F?

SSE2/4 code will be generated with VEX prefix. That means pshufb becomes vpshufb and works with other AVX code in ymm. Instructions will NOT be promoted to AVX512's native EVEX or allowed to address the new zmm registers specifically. Instructions can only be promoted to EVEX with AVX512-VL, in which case they gain the ability to directly address (renamed) zmm registers. It is unknown whether register sharing is possible at this point, but pipelining on AVX2 has demonstrated similar throughput with half-width AVX2 (AVX-128) as with full 256-bit AVX2 code in many cases.

Most importantly, how do I get my SSE2/4/AVX128 byte/word size code running on AVX512F?

You'll have to load 128-bit chunks into xmm, sign/zero extend those bytes/words into 32-bit in zmm, and operate as if they were always larger integers. Then when finished, convert back to bytes/words.

Is this fast?

According to material published on Larrabee (Knights Landing's prototype), type conversions of any integer width are free from xmm to zmm and vice versa, so long as registers are available. Additionally, after calculations are performed, the 32-bit results can be truncated on the fly down to byte/word length and written (packed) to unaligned memory in 128-bit chunks, potentially saving an xmm register.

On KNL, each core has 2 VPUs that seem to be capable of talking to each other. Hence, 32-way 32-bit lookups are possible in a single vperm*2d instruction of presumably reasonable throughput. This is not possible even with AVX2, which can only permute within 128-bit lanes (or between lanes for the 32-bit vpermd only, which is inapplicable to byte/word instructions). Combined with free type conversions, the ability to use masks implicitly with AVX512 (sparing the costly and register-intensive use of blendv or explicit mask generation), and the presence of more comparators (native NOT, unsigned/signed lt/gt, etc), it may provide a reasonable performance boost to rewrite SSE2/4 byte/word code for AVX512F after all. At least on KNL.

Don't worry, I'll test the moment I get my hands on mine. ;-)