I wrote a simple vector addition program using vector intrinsic operations in C. Here I load 2 vectors and add them, finally store result vector back to global memory.
When I check the assembly code, it has the following sequence of instructions
movdqa 0(%rbp,%rax), %xmm7
paddd (%r12,%rax), %xmm7
movdqa %xmm7, (%rbx,%rax)
As you can see, it only moves one operand of the paddd instruction to a register (xmm7). In the paddd instruction 1st operand refers to address in global memory instead of moving it a register first.
Does this mean that when paddd get executed, it does a mov from global memory to register first and then add two operands which are in registers? Which is equivalent to the following code sequence
movdqa 0(%rbp,%rax), %xmm7
movdqa 0(%r12,%rax), %xmm8
paddd %xmm8, %xmm7
movdqa %xmm7, (%rbx,%rax)
Let me know if you need more information like compilable program, so that you can generate assembly for yourself.
paddd (%r12,%rax), %xmm7is a two uop instruction on most processors. One for the load, and one for the add. - Mysticial