11
votes

Consider the following small function:

void foo(int* iptr) {
    iptr[10] = 1;
    __asm__ volatile ("nop"::"r"(iptr):);
    iptr[10] = 2;
}

Using gcc, this compiles to:

foo:
        nop
        mov     DWORD PTR [rdi+40], 2
        ret

Note in particular, that the first write to iptr, iptr[10] = 1 doesn't occur at all: the inline asm nop is the first thing in the function, and only the final write of 2 appears (after the ASM call). Apparently the compiler decides that it only needs to provide an up-to-date version of the value of iptr itself, but not the memory it points to.

I can tell the compiler that memory must be up to date with a memory clobber, like so:

void foo(int* iptr) {
    iptr[10] = 1;
    __asm__ volatile ("nop"::"r"(iptr):"memory");
    iptr[10] = 2;
}

which results in the expected code:

foo:
        mov     DWORD PTR [rdi+40], 1
        nop
        mov     DWORD PTR [rdi+40], 2
        ret

However, this is too strong of a condition, since it tells the compiler all memory has to be written. For example, in the following function:

void foo2(int* iptr, long* lptr) {
    iptr[10] = 1;
    lptr[20] = 100;
    __asm__ volatile ("nop"::"r"(iptr):);
    iptr[10] = 2;
    lptr[20] = 200;
}

The desired behavior is to let the compiler optimize away the first write to lptr[20], but not the first write to iptr[10]. The "memory" clobber cannot achieve this because it means both writes have to occur:

foo2:
        mov     DWORD PTR [rdi+40], 1
        mov     QWORD PTR [rsi+160], 100 ; lptr[10] written unecessarily
        nop
        mov     DWORD PTR [rdi+40], 2
        mov     QWORD PTR [rsi+160], 200
        ret

Is there some way to tell compilers accepting gcc extended asm syntax that the input to the asm includes the pointer and anything it can point to?

1
@HadiBrais - yes, but this is just a side of effect of the = meaning that the asm might change the pointer value, so gcc has to make both writes (since they might write to different locations). However, it doesn't mean gcc has to do the write before calling the inline asm (although it happens to in this case) and so it doesn't work in general (you can construct a similar example where it fails).BeeOnRope
Also, don't you want "+r"? Using "=r" doesn't even require the compiler to pass the pointer value into the asm, I think?BeeOnRope
You want the "Specifically, a "m" (*(const float (*)[]) fptr) will tell the compiler that the entire array object is an input, arbitrary-length." part from the answer to the linked duplicate.Jester
That is true. But if the answer helps, then I guess marking it as duplicate is fine. Future visitors who might find your question will be pointed in the right direction. Or do you propose some other solution? We could ask Peter to post an answer here too so he gets the credits.Jester
@Jester: I think we could use a stand-alone canonical Q&A about this topic. I'd been planning to write one up at some point with a simple example that shows unwanted dead-store elimination. Putting it into my loop over arrays answer was just a stopgap. We want a question with a title that summarizes the problem, a question body that demonstrates the problem, and an answer that shows the currently best-practice dummy memory input or output operand with the cast-to-array syntax. That's good for linking to show people that the problem exists in the first place.Peter Cordes

1 Answers

14
votes

That's correct; asking for a pointer as input to inline asm does not imply that the pointed-to memory is also an input or output or both. With a register input and register output, for all gcc knows your asm just aligns a pointer by masking off the low bits, or adds a constant to it. (In which case you would want it to optimize away a dead store.)

The simple option is asm volatile and a "memory" clobber1.

The narrower more specific way you're asking for is to use a "dummy" memory operand as well as the pointer in a register. Your asm template doesn't reference this operand (except maybe inside an asm comment to see what the compiler picked). It tells the compiler which memory you actually read, write, or read+write.

Dummy memory input: "m" (*(const int (*)[]) iptr)
or output: "=m" (*(int (*)[]) iptr). Or of course "+m" with the same syntax.

That syntax is casting to a pointer-to-array and dereferencing, so the actual input is a C array. (If you actually have an array, not pointer, you don't need any casting and can just ask for it as a memory operand.)

If you leave the size unspecified with [], that tells GCC that any memory accessed relative to that pointer is an input, output, or in/out operand. If you use [10] or [some_variable], that tells the compiler the specific size. With runtime-variable sizes, gcc in practice misses the optimization that iptr[size+1] is not part of the input.

GCC documents this and therefore supports it. I think it's not a strict-aliasing violation if the array element type is the same as the pointer, or maybe if it's char.

(from the GCC manual)
An x86 example where the string memory argument is of unknown length.

   asm("repne scasb"
    : "=c" (count), "+D" (p)
    : "m" (*(const char (*)[]) p), "0" (-1), "a" (0));

If you can avoid using an early-clobber on the pointer input operand, the dummy memory input operand will typically pick a simple addressing mode using that same register.

But if you do use an early-clobber for strict correctness of an asm loop, sometimes a dummy operand will make gcc waste instructions (and an extra register) on a base address for the memory operand. Check the asm output of the compiler.


Background:

This is a widespread bug in inline-asm examples which often goes undetected because the asm is wrapped in a function that doesn't inline into any callers that tempt the compiler into reordering stores for merging doing dead-store elimination.

GNU C inline asm syntax is designed around describing a single instruction to the compiler. The intent is that you tell the compiler about a memory input or memory output with a "m" or "=m" operand constraint, and it picks the addressing mode.

Writing whole loops in inline asm requires care to make sure the compiler really knows what's going on (or asm volatile plus a "memory" clobber), otherwise you risk breakage when changing the surrounding code, or enabling link-time optimization that allows for cross-file inlining.

See also Looping over arrays with inline assembly for using an asm statement as the loop body, still doing the loop logic in C. With actual (non-dummy) "m" and "=m" operands, the compiler can unroll the loop by using displacements in the addressing modes it chooses.


Footnote 1: A "memory" clobber gets the compiler to treat the asm like a non-inline function call (that could read or write any memory except for locals that escape analysis has proved have not escaped). The escape analysis includes input operands to the asm statement itself, but also any global or static variables that any earlier call could have stored pointers into. So usually local loop counters don't have to be spilled/reloaded around an asm statement with a "memory" clobber.

asm volatile is necessary to make sure the asm isn't optimized away even if its output operands are unused (because you require the un-declared the side-effect of writing memory to happen).

Or for memory that is only read by asm, you you need the asm to run again if the same input buffer contains different input data. Without volatile, the asm statement could be CSEd out of a loop. (A "memory" clobber does not make the optimizer treat all memory as an input when considering whether the asm statement even needs to run.)

asm with no output operands is implicitly volatile, but it's a good idea to make it explicit. (The GCC manual has a section on asm volatile).

e.g. asm("... sum an array ..." : "=r"(sum) : "r"(pointer), "r"(end_pointer) : "memory") has an output operand so is not implicitly volatile. If you used it like

 arr[5] = 1;
 total += asm_sum(arr, len);
 memcpy(arr, foo, len);
 total += asm_sum(arr, len);

Without volatile the 2nd asm_sum could optimize away, assuming that the same asm with the same input operands (pointer and length) will produce the same output. You need volatile for any asm that's not a pure function of its explicit input operands. If it doesn't optimize away, then the "memory" clobber will have the desired effect of requiring memory to be in sync.