1
votes

I wrote a simple vector addition program using vector intrinsic operations in C. Here I load 2 vectors and add them, finally store result vector back to global memory.

When I check the assembly code, it has the following sequence of instructions

movdqa  0(%rbp,%rax), %xmm7    
paddd (%r12,%rax), %xmm7
movdqa  %xmm7, (%rbx,%rax)

As you can see, it only moves one operand of the paddd instruction to a register (xmm7). In the paddd instruction 1st operand refers to address in global memory instead of moving it a register first.

Does this mean that when paddd get executed, it does a mov from global memory to register first and then add two operands which are in registers? Which is equivalent to the following code sequence

movdqa  0(%rbp,%rax), %xmm7
movdqa  0(%r12,%rax), %xmm8    
paddd %xmm8, %xmm7
movdqa  %xmm7, (%rbx,%rax)

Let me know if you need more information like compilable program, so that you can generate assembly for yourself.

1
It's the equivalent of the second example except that XMM8 isn't changed. Or would be if you used the same base register (R12 vs R11) in both examples. - Ross Ridge
Instructions are often broken up into one or more uops within the processor. paddd (%r12,%rax), %xmm7 is a two uop instruction on most processors. One for the load, and one for the add. - Mysticial

1 Answers

6
votes

Most x86 instructions can be used with a memory source operand. No extra register is needed. Read-modify instructions are just as fast as the combination of a load and then the operation. The advantage is that it takes fewer instruction bytes, and doesn't need an extra register.

It can also execute more efficiently in some cases on Intel CPUs (uop micro-fusion). So if you don't need the data at that memory address again soon, prefer folding loads into other instructions.

See http://agner.org/optimize/ for docs on CPU internals, and how to optimize your asm and C code.