The Intel® 64 and IA-32 Architectures Optimization Reference Manual §5.1 says something similar about mixing integer/FP "data types" (but curiously not singles and doubles):
When writing SIMD code that works for both integer and floating-point data, use
the subset of SIMD convert instructions or load/store instructions to ensure that
the input operands in XMM registers contain data types that are properly defined
to match the instruction.
Code sequences containing cross-typed usage produce the same result across
different implementations but incur a significant performance penalty. Using
SSE/SSE2/SSE3/SSSE3/SSE44.1 instructions to operate on type-mismatched
SIMD data in the XMM register is strongly discouraged.
The Intel® 64 and IA-32 Architectures
Software Developer’s Manual is simularly confusing:
SSE and SSE2 extensions define typed operations on packed and scalar floating-point data types and on 128-bit
SIMD integer data types, but IA-32 processors do not enforce this typing at the architectural level. They only
enforce it at the microarchitectural level.
...
Pentium 4 and Intel Xeon processors execute these instructions without generating an invalid-operand exception
(#UD) and will produce the expected results in register XMM0 (that is, the high and low 64-bits of each register will
be treated as a double-precision floating-point value and the processor will operate on them accordingly).
...
In this example: XORPS or PXOR can be used in place of XORPD and yield the same correct result. However,
because of the type mismatch between the operand data type and the instruction data type, a latency penalty will
be incurred due to implementations of the instructions at the microarchitecture level.
Latency penalties can also be incurred by using move instructions of the wrong type. For example, MOVAPS and
MOVAPD can both be used to move a packed single-precision operand from memory to an XMM register. However,
if MOVAPD is used, a latency penalty will be incurred when a correctly typed instruction attempts to use the data in
the register.
Note that these latency penalties are not incurred when moving data from XMM registers to memory.
I really have no idea what it means by "they only enforce it at the microarchitectural level" except that it suggests the different "data types" are treated differently by the μarch. I have a few guesses:
- AIUI, x86 cores typically use register renaming due to the shortage of registers. Perhaps they internally use different registers for integer/single/double operands so they can be located nearer to the respective vector units.
- It also seems possible that FP numbers are represented internally using a different format (e.g. using a bigger exponent to get rid of denorms) and converted to the canonical bits only when necessary.
- CPUs use "forwarding" or "bypassing" so that execution units don't have to wait for data to be written to a register before it can be used by subsequent instructions, typically saving a cycle or two. This might not happen between the integer and FP units.
movaps
to load doubles, it works out anyway? It's a little weird to word it the way they did, especially since there is no conversion, but I don't see what else they could mean.. – harold