I wrote this simple Java program:
package com.salil.threads;
public class IncrementClass {
static volatile int j = 0;
static int i = 0;
public static void main(String args[]) {
for(int a=0;a<1000000;a++);
i++;
j++;
}
}
This generate the following disassembled code for i++ and j++ (remaining disassembled code removed):
0x0000000002961a6c: 49ba98e8d0d507000000 mov r10,7d5d0e898h
; {oop(a 'java/lang/Class' = 'com/salil/threads/IncrementClass')}
0x0000000002961a76: 41ff4274 inc dword ptr [r10+74h]
;*if_icmpge
; - com.salil.threads.IncrementClass::main@5 (line 10)
0x0000000002961a7a: 458b5a70 mov r11d,dword ptr [r10+70h]
0x0000000002961a7e: 41ffc3 inc r11d
0x0000000002961a81: 45895a70 mov dword ptr [r10+70h],r11d
0x0000000002961a85: f083042400 lock add dword ptr [rsp],0h
;*putstatic j
; - com.salil.threads.IncrementClass::main@27 (line 14)
This is what I understand about the following assembly code:
- mov r10,7d5d0e898h : Moves the pointer to the IncrementClass.class to register r10
- inc dword ptr [r10+74h] : Increments the 4 byte value at the address at [r10 + 74h],(i.e. i)
- mov r11d,dword ptr [r10+70h] :Moves the 4 value value at the address [r10 + 70h] to register r11d (i.e move value of j to r11d)
- inc r11d : Increment r11d
- mov dword ptr [r10+70h],r11d : write value of r11d to [r10 + 70h] so it is visible to other threads -lock add dword ptr [rsp],0h : lock the memory address represented by the stack pointer rsp and add 0 to it.
JMM states that before each volatile read there must be a load memory barrier and after every volatile write there must be a store barrier. My question is:
- Why isn't there a load barrier before the read of j into r11d?
- How does the lock and add to rsp ensure the value of j in r11d is propogated back to main memory. All I read from the intel specs is that lock provides the cpu with an exclusive lock on the specified memory address for the duration of the operation.
lock inc dword [r10+70h]would do everything that load/inc/store/full-barrier does, and more (i.e. actually be atomic). It would be at least as fast, and many fewer code bytes.lock add [rsp], 0is a full-barrier because everylocked instruction is. There's debate about whether MFENCE or an otherwise no-op locked insn to stack memory (which should be in the E state in L1 already) is better. MFENCE has worse throughput, but fewer uops so maybe less impact on surrounding instructions when a chain of MFENCE isn't all you're doing. - Peter Cordesmov r10, imm64is also suspicious. That's inside the loop??? Is this optimized code from a JIT? Isinc r11dthe loop counter, or is that at least kept in a register? - Peter Cordes