x64 SSE data types

Question

AMD64 Architecture Programmer’s Manual Volume 1: Application Programming page 226 says regarding SSE instructions:

The processor does not check the data type of instruction operands prior to executing instructions. It only checks them at the point of execution. For example, if the processor executes an arithmetic instruction that takes double-precision operands but is provided with single-precision operands by MOVx instructions, the processor will first convert the operands from single precision to double precision prior to executing the arithmetic operation, and the result will be correct. However, the required conversion may cause degradation of performance.

I don't understand this; I would have thought ymm registers simply contain 256 bits which each instruction interprets according to its expected operands, it's up to you to make sure the correct types are present, and in the scenario described, the CPU would run at full speed and silently give the wrong answer.

What am I missing?

Maybe they refer to the fact that if you used movaps to load doubles, it works out anyway? It's a little weird to word it the way they did, especially since there is no conversion, but I don't see what else they could mean.. — harold

tc. tc. · Accepted Answer · 2013-03-10T20:42:34

The Intel® 64 and IA-32 Architectures Optimization Reference Manual §5.1 says something similar about mixing integer/FP "data types" (but curiously not singles and doubles):

When writing SIMD code that works for both integer and floating-point data, use the subset of SIMD convert instructions or load/store instructions to ensure that the input operands in XMM registers contain data types that are properly defined to match the instruction.

Code sequences containing cross-typed usage produce the same result across different implementations but incur a significant performance penalty. Using SSE/SSE2/SSE3/SSSE3/SSE44.1 instructions to operate on type-mismatched SIMD data in the XMM register is strongly discouraged.

The Intel® 64 and IA-32 Architectures Software Developer’s Manual is simularly confusing:

SSE and SSE2 extensions define typed operations on packed and scalar floating-point data types and on 128-bit SIMD integer data types, but IA-32 processors do not enforce this typing at the architectural level. They only enforce it at the microarchitectural level.

...

Pentium 4 and Intel Xeon processors execute these instructions without generating an invalid-operand exception (#UD) and will produce the expected results in register XMM0 (that is, the high and low 64-bits of each register will be treated as a double-precision floating-point value and the processor will operate on them accordingly).

...

In this example: XORPS or PXOR can be used in place of XORPD and yield the same correct result. However, because of the type mismatch between the operand data type and the instruction data type, a latency penalty will be incurred due to implementations of the instructions at the microarchitecture level.

Latency penalties can also be incurred by using move instructions of the wrong type. For example, MOVAPS and MOVAPD can both be used to move a packed single-precision operand from memory to an XMM register. However, if MOVAPD is used, a latency penalty will be incurred when a correctly typed instruction attempts to use the data in the register.

Note that these latency penalties are not incurred when moving data from XMM registers to memory.

I really have no idea what it means by "they only enforce it at the microarchitectural level" except that it suggests the different "data types" are treated differently by the μarch. I have a few guesses:

AIUI, x86 cores typically use register renaming due to the shortage of registers. Perhaps they internally use different registers for integer/single/double operands so they can be located nearer to the respective vector units.
It also seems possible that FP numbers are represented internally using a different format (e.g. using a bigger exponent to get rid of denorms) and converted to the canonical bits only when necessary.
CPUs use "forwarding" or "bypassing" so that execution units don't have to wait for data to be written to a register before it can be used by subsequent instructions, typically saving a cycle or two. This might not happen between the integer and FP units.

x64 SSE data types

1 Answers