Agner Fog says:
The instructions MOVDQA
, MOVDQU
, MOVAPS
, MOVUPS
, MOVAPD
and MOVUPD
are all identical when used with [128 bit] register operands
Then he goes on to say (he's using the aligned versions in his examples, but I'm guessing the same applies for the unaligned variants):
On Intel Core 2 and earlier Intel processors, some floating point instructions are executed in
the integer units. This includes XMM move instructions, Boolean, and some shuffle and
pack instructions. These instructions have a bypass delay when mixed with instructions that
use the floating point unit. On most other processors, the execution unit used is in
accordance with the instruction name, e.g. MOVAPS XMM1,XMM2
uses the floating point unit,
MOVDQA XMM1,XMM2
uses the integer unit.
Instructions that read or write memory use a separate unit. The bypass delay from the
memory unit to the floating point unit may be longer than to the integer unit on some
processors, but it doesn't depend on the type of the instruction. Thus, there is no difference
in latency between MOVAPS XMM0,[MEM]
and MOVDQA XMM0,[MEM]
on current processors,
but it cannot be ruled out that there will be a difference on future processors.
[Y]ou may use MOVAPS
instead of MOVAPD
or MOVDQA
for moving data to or from
memory or between registers. A bypass delay occurs in some processors when using
MOVAPS
for moving the result of an integer instruction to another register, but not when
moving data to or from memory.