Type punning with (float&)int works, (float const&)int converts like (float)int instead?

Question

VS2019, Release, x86.

template <int i> float get() const {
    int f = _mm_extract_ps(fmm, i);
    return (float const&)f;
}

When use return (float&)f; compiler uses

extractps m32, ...
movss xmm0, m32

.correct result

When use return (float const&)f; compiler uses

extractps eax, ...
movd xmm0, eax

.wrong result

The main idea that T& and T const& is at first T then const. Const is just some kind of agreement for programmers. You know that you can get around it. But there is NO any const in assembly code, but type float IS. And I think that for both float& and float const& it MUST be float representation (cpu register) in assembly. We can use intermediate int reg32, but the final interpretation must be float.

And at this time it looks like regression, because this worked fine before. And also using float& in this case is definitely strange, because we shouldn't case about float const& safety but temp var for float& is really questionable.

Microsoft answered:

Hi Truthfinder, thanks for the self-contained repro. As it happens, this behavior is actually correct. As my colleague @Xiang Fan [MSFT] described in an internal email:

The conversions performed by [a c-style cast] tries the following sequence: (4.1) — a const_cast (7.6.1.11), (4.2) — a static_cast (7.6.1.9), (4.3) — a static_cast followed by a const_cast, (4.4) — a reinterpret_cast (7.6.1.10), or (4.5) — a reinterpret_cast followed by a const_cast,

If a conversion can be interpreted in more than one of the ways listed above, the interpretation that appears first in the list is used.

So in your case, (const float &) is converted to static_cast, which has the effect "the initializer expression is implicitly converted to a prvalue of type “cv1 T1”. The temporary materialization conversion is applied and the reference is bound to the result."

But in the other case, (float &) is converted to reinterpret_cast because static_cast isn’t valid, which is the same as reinterpret_cast(&operand).

The actual "bug" you're observing is that one cast does: "transform the float-typed value "1.0" into the equivalent int-typed value "1"", while the other cast says "find the bit representation of 1.0 as a float, and then interpret those bits as an int".

For this reason we recommend against c-style casts.

Thanks!

MS forum link: https://developercommunity.visualstudio.com/content/problem/411552/extract-ps-intrinsics-bug.html

Any ideas?

P.S. What do I really want:

float val = _mm_extract_ps(xmm, 3);

In manual assembly I can write: extractps val, xmm0, 3 where val is float 32 memory variable. Only ONE! instruction. I want see the same result in compiler generated assembly code. No shuffles or any other excessive instructions. The most bad acceptable case is: extractps reg32, xmm0, 3; mov val, reg32.

My point about T& and T const&: The type of variable must be the SAME for both cases. But now float& will interpret m32 as float32 and float const& will interpret m32 as int32.

int main() {
    int z = 1;
    float x = (float&)z;
    float y = (float const&)z;
    printf("%f %f %i", x, y, x==y);
    return 0;
}

Out: 0.000000 1.000000 0

Is that really OK?

Best regards, Truthfinder

If 4.2 is used for the const-ref, why wouldn't 4.3 be used for the mutable-ref? — JVApen
PS: I agree with the C-style cast remark, don't use it, it only gives you bugs. Why ain't you casting to a bare float, without the reference? — JVApen
Please let me not to agree. Because it's C++ at first. C-style cast is not a good practice for C++. C-style in C++ is a crutch again. — truthfinder
In your real code, does the resulting float need to get stored to memory instead of being in an xmm reg where you can use it? You wrote a function that returns a float, instead of storing into a float*, so any asm you look at from that will obviously finish with the result in a register. Storing a float to memory is one of the few uses for extractps. And yes, maybe on some future CPU it will be only 1 uop total, and thus better than shufps + movss for a memory dst. gcc/clang will use extractps for the code in my answer with *out=... godbolt.org/z/oLgBd4 but not MSVC — Peter Cordes
And BTW, extractps reg32, xmm0, 3 / mov val, reg32 is 3 uops including a shuffle as part of extractps, far worse than shufps / movss (especially on Bulldozer-family where latency between XMM and scalar integer is high, although if you're just storing then OoO exec can hide it if you don't reload soon). IDK why you think that would be acceptable. — Peter Cordes

Peter Cordes Peter Cordes · Accepted Answer · 2019-01-31T06:57:51

There's an interesting question about C++ cast semantics (which Microsoft already briefly answered for you), but it's mixed up with your misuse of _mm_extract_ps resulting in needing a type-pun in the first place. (And only showing asm that is equivalent, omitting the int->float conversion.) If someone else wants to expand on the standard-ese in another answer, that would be great.

TL:DR: use this instead: it's zero or one shufps. No extractps, no type punning.

template <int i> float get(__m128 input) {
    __m128 tmp = input;
    if (i)     // constexpr i means this branch is compile-time-only
        tmp = _mm_shuffle_ps(tmp,tmp,i);  // shuffle it to the bottom.
    return _mm_cvtss_f32(tmp);
}

If you actually have a memory-destination use case, you should be looking at asm for a function that takes a float* output arg, not a function that needs the result in xmm0. (And yes, that is a use-case for the extractps instruction, but arguably not the _mm_extract_ps intrinsics. gcc and clang use extractps when optimizing *out = get<2>(in), although MSVC misses that and still uses shufps + movss.)

Both blocks of asm you show are simply copying the low 32 bits of xmm0 somewhere, with no conversion to int. You left out the important different, and only showed the part that just uselessly copies the float bit-pattern out of xmm0 and then back, in 2 different ways (to register or to memory). movd is a pure copy of the bits unmodified, just like the movss load.

It's the compiler's choice which to use, after you force it to use extractps at all. Going through a register and back is lower latency than store/reload, but more ALU uops.

The (float const&) attempt to type-pun does include a conversion from FP to integer, which you didn't show. As if we needed any more reason to avoid pointer/reference casting for type-punning, this really does mean something different: (float const&)f takes the integer bit-pattern (from _mm_extract_ps) as an int and converts that to float.

I put your code on the Godbolt compiler explorer to see what you left out.

float get1_with_extractps_const(__m128 fmm) {
    int f = _mm_extract_ps(fmm, 1);
    return (float const&)f;
}

;; from MSVC -O2 -Gv  (vectorcall passes __m128 in xmm0)
float get1_with_extractps_const(__m128) PROC   ; get1_with_extractps_const, COMDAT
    extractps eax, xmm0, 1   ; copy the bit-pattern to eax

    movd    xmm0, eax      ; these 2 insns are an alternative to pxor xmm0,xmm0 + cvtsi2ss xmm0,eax to avoid false deps and zero the upper elements
    cvtdq2ps xmm0, xmm0    ; packed conversion is 1 uop
    ret     0

GCC compiles it this way:

get1_with_extractps_const(float __vector(4)):    # gcc8.2 -O3 -msse4
        extractps       eax, xmm0, 1
        pxor    xmm0, xmm0            ; cvtsi2ss has an output dependency so gcc always does this
        cvtsi2ss        xmm0, eax     ; MSVC's way is probably better for float.
        ret

Apparently MSVC does define the behaviour of pointer/reference casting for type-punning. Plain ISO C++ doesn't (strict aliasing UB), and neither do other compilers. Use memcpy to type-pun, or a union (which GNU C and MSVC support in C++ as an extension). Of course in this case, type-punning the vector element you want to an integer and back is horrible.

Only for (float &)f does gcc warn about the strict-aliasing violation. And GCC / clang agree with MSVC that only this version is a type-pun, not materializing a float from an implicit conversion. C++ is weird!

float get1_with_extractps_nonconst(__m128 fmm) {
    int f = _mm_extract_ps(fmm, 1);
    return (float &)f;
}

<source>: In function 'float get_with_extractps_nonconst(__m128)':
<source>:21:21: warning: dereferencing type-punned pointer will break strict-aliasing rules [-Wstrict-aliasing]
     return (float &)f;
                     ^

gcc optimizes away the extractps altogether.

# gcc8.2 -O3 -msse4
get1_with_extractps_nonconst(float __vector(4)):
    shufps  xmm0, xmm0, 85    ; 0x55 = broadcast element 1 to all elements
    ret

Clang uses SSE3 movshdup to copy element 1 to 0. (And element 3 to 2). But MSVC doesn't, which is another reason to never use this:

float get1_with_extractps_nonconst(__m128) PROC
    extractps DWORD PTR f$[rsp], xmm0, 1     ; store
    movss   xmm0, DWORD PTR f$[rsp]          ; reload
    ret     0

Don't use `_mm_extract_ps` for this

Both of your versions are horrible because this is not what _mm_extract_ps or extractps are for. Intel SSE: Why does `_mm_extract_ps` return `int` instead of `float`?

A float in a register is the same thing as the low element of a vector. The high elements don't need to be zeroed. And if they did, you'd want to use insertps which can do xmm,xmm and zero elements according to an immediate.

Use _mm_shuffle_ps to bring the element you want to the low position of a register, and then it is a scalar float. (And you can tell a C++ compiler that with _mm_cvtss_f32). This should compile to just shufps xmm0,xmm0,2, without an extractps or any mov.

template <int i> float get() const {
    __m128 tmp = fmm;
    if (i)                               // i=0 means the element is already in place
        tmp = _mm_shuffle_ps(tmp,tmp,i);  // else shuffle it to the bottom.
    return _mm_cvtss_f32(tmp);
}

(I skipped using _MM_SHUFFLE(0,0,0,i) because that's equal to i.)

If your fmm was in memory, not a register, then hopefully compilers would optimize away the shuffle and just movss xmm0, [mem]. MSVC 19.14 does manage to do that, at least for the function-arg on the stack case. I didn't test other compilers, but clang should probably manage to optimize away the _mm_shuffle_ps; it's very good at seeing through shuffles.

Test-case proving this compiles efficiently

e.g. a test-case with a non-class-member version of your function, and a caller that inlines it for a specific i:

#include <immintrin.h>

template <int i> float get(__m128 input) {
    __m128 tmp = input;
    if (i)                  // i=0 means the element is already in place
        tmp = _mm_shuffle_ps(tmp,tmp,i);  // else shuffle it to the bottom.
    return _mm_cvtss_f32(tmp);
}

// MSVC -Gv (vectorcall) passes arg in xmm0
// With plain dumb x64 fastcall, arg is on the stack, and it *does* just MOVSS load without shuffling
float get2(__m128 in) {
    return get<2>(in);
}

From the Godbolt compiler explorer, asm output from MSVC, clang, and gcc:

;; MSVC -O2 -Gv
float get<2>(__m128) PROC               ; get<2>, COMDAT
        shufps  xmm0, xmm0, 2
        ret     0
float get<2>(__m128) ENDP               ; get<2>

;; MSVC -O2  (without Gv, so the vector comes from memory)
input$ = 8
float get<2>(__m128) PROC               ; get<2>, COMDAT
        movss   xmm0, DWORD PTR [rcx+8]
        ret     0
float get<2>(__m128) ENDP               ; get<2>

# gcc8.2 -O3 for x86-64 System V (arg in xmm0)
get2(float __vector(4)):
        shufps  xmm0, xmm0, 2   # with -msse4, we get unpckhps
        ret

# clang7.0 -O3 for x86-64 System V (arg in xmm0)
get2(float __vector(4)):
        unpckhpd        xmm0, xmm0      # xmm0 = xmm0[1,1]
        ret

clang's shuffle optimizer simplifies to unpckhpd, which is faster on some old CPUs. Unfortunately it didn't notice it could have used movhlps xmm0,xmm0, which is also fast and 1 byte shorter.

Type punning with (float&)int works, (float const&)int converts like (float)int instead?

4 Answers

TL:DR: use this instead: it's zero or one shufps. No extractps, no type punning.

Don't use _mm_extract_ps for this

Test-case proving this compiles efficiently

Don't use `_mm_extract_ps` for this