6
votes

I was engaged with an expert who allegedly has vastly superior coding skills than myself who understands inline assembly far better than I ever could.

One of the claims is that as long as an operand appears as an input constraint, you don't need to list it as a clobber or specify that the register has been potentially modified by the inline assembly. The conversation came about when someone else was trying to get assistance on a memset implementation that was effectively coded this way:

void *memset(void *dest, int value, size_t count)
{
    asm volatile  ("cld; rep stosb" :: "D"(dest), "c"(count), "a"(value));
    return dest;
}

The expert's claim when I commented about the issue with clobbering registers without telling the compiler, was to tell us that:

"c"(count) already tells the compiler c is clobbered

I found an example in the expert's own operating system where they write similar code with the same design pattern. They use Intel syntax for their inline assembly. This hobby operating system code operates in a kernel (ring0) context. An example is this buffer swap function1:

void swap_vbufs(void) {
    asm volatile (
        "1: "
        "lodsd;"
        "cmp eax, dword ptr ds:[rbx];"
        "jne 2f;"
        "add rdi, 4;"
        "jmp 3f;"
        "2: "
        "stosd;"
        "3: "
        "add rbx, 4;"
        "dec rcx;"
        "jnz 1b;"
        :
        : "S" (antibuffer0),
          "D" (framebuffer),
          "b" (antibuffer1),
          "c" ((vbe_pitch / sizeof(uint32_t)) * vbe_height)
        : "rax"
    );

    return;
}

antibuffer0, antibuffer1, and framebuffer are all buffers in memory treated as arrays of uint32_t. framebuffer is actual video memory (MMIO) and antibuffer0, antibuffer1 are buffers allocated in memory.

The global variables are properly set up before this function is called. They are declared as:

volatile uint32_t *framebuffer;
volatile uint32_t *antibuffer0;
volatile uint32_t *antibuffer1;

int vbe_width = 1024;
int vbe_height = 768;
int vbe_pitch;

My Questions and Concerns about this Kind of Code

As an apparent neophyte to inline assembly having an apparent naive understanding of the subject, I'm wondering whether my apparent uneducated belief this code is potentially very buggy is correct. I want to know if these concerns have any merit:

  1. RDI, RSI, RBX, and RCX are all modified by this code. RDI and RSI are incremented by LODSD and STOSD implicitly. The rest are modified explicitly with

        "add rbx, 4;"
        "dec rcx;"
    

    None of these registers are listed as input/output nor are they listed as output operands. I believe these constraints need to be modified to inform the compiler that these registers may have been modified/clobbered. The only register that is listed as clobbered which I believe is correct is RAX. Is my understanding correct? My feeling is that RDI, RSI, RBX, and RCX should be input/output constraints (Using the + modifier). Even if one tries to argue that the 64-bit System V ABI calling convention will save them (assumptions that a poor way IMHO to write such code) RBX is a non-volatile register that will change in this code.

  2. Since the addresses are passed via registers (and not memory constraints), I believe it is a potential bug that the compiler hasn't been told that memory that these pointers are pointing at has been read and/or modified. Is my understanding correct?

  3. RBX, and RCX are hard coded registers. Wouldn't it make sense to allow the compiler to choose these registers automatically via the constraints?

  4. If one assumes that inline assembly has to be used here (hypothetically) what would bug free GCC inline assembly code look like for this function? Is this function fine as is, and I just don't understand the basics of GCC's extended inline assembly like the expert does?


Footnotes

  • 1The swap_vbufs function and associated variable declarations have been reproduced verbatim without the copyright holder's permission under fair use for purposes of commentary about a larger body of work.
1
From the docs: Do not modify the contents of input-only operands (except for inputs tied to outputs). The compiler assumes that on exit from the asm statement these operands contain the same values as they had before executing the statement. It is not possible to use clobbers to inform the compiler that the values in these inputs are changing. Pretty unambiguous.David Wohlferd
I may have understated my understanding of writing code or my understanding of inline assembly when posting my question. I felt it prudent in the face of an author with alleged superior knowledge and understanding - to believe that I must be a neophyte when looking at this kind of thing.Michael Petch
@MichaelPetch: yeah very much gcc.gnu.org/wiki/DontUseInlineAsm here; swap_vbufs doesn't look optimal. I think it's just a copy-if-different. I guess if diffs are rare, the branch misses here might be worth avoiding MMIO stores, but a rep movsd or movnti unconditional streaming store to video RAM might be fine. Either way, it's not the most efficient implementation of the loop; using stosd makes it worse because it requires an else add rdi,4 branch with an extra jmp on the (hopefully) fast path.Peter Cordes
"work by luck alone" - In this case, it's clear that's what's happening. But the problem is that ANY asm might be working only due to luck. Unless/until it fails, how would you know that you missed an early clobber or some such? Examining the output just isn't sufficient to ensure correctness. And yeah, there's a lot of badly written asm out there which people who don't understand it just copy. Even if the original author cleans it up, everyone else is still screwed. It sounds like you've got a good understand of the how and when though.David Wohlferd
As for DontUseInlineAsm, I don't know that there's any great wisdom there, but there might be things that people who don't use asm often haven't fully considered. I fell in love with asm when I first discovered it. But over time, I came to understand that despite its power, it's almost always a bad idea. It gives interesting insight into how the compiler works, but I'd think long and hard before using it in production code. That said, I mentally add 1 to my karma every time it gets referenced...David Wohlferd

1 Answers

7
votes

You are correct on all counts, this code is full of lies to the compiler that could bite you. e.g. with different surrounding code, or different compiler versions / options (especially link-time optimization to enable cross-file inlining).

swap_vbufs doesn't even look very efficient, I suspect gcc would do equal or better with a pure C version. https://gcc.gnu.org/wiki/DontUseInlineAsm. stosd is 3 uops on Intel, worse than a regular mov-store + add rdi,4. And making add rdi,4 unconditional would avoid the need for that else block which puts an extra jmp on the (hopefully) fast path where there's no MMIO store to video RAM because the buffers were equal.

(lodsd is only 2 uops on Haswell and newer so that's ok if you don't care about IvyBridge or older).

In kernel code I guess they're avoiding SSE2, even though it's baseline for x86-64, otherwise you'd probably want to use that. For a normal memory destination, you'd just memcpy with rep movsd or ERMSB rep movsb, but I guess the point here is to avoid MMIO stores when possible by checking against a cached copy of video RAM. Still, unconditional streaming stores with movnti might be efficient, unless video RAM is mapped UC (uncacheable) instead of WC.


It's easy to construct examples where this really does break in practice, by e.g. using the relevant C variable again after the inline asm statement in the same function. (Or in a parent function which inlined the asm).

An input you want to destroy has to be handled usually with a matching dummy output or a RMW output with a C tmp var, not just "r". or "a".

"r" or any specific-register constraint like "D" means this is a read-only input, and the compiler can expect to find the value undisturbed afterwards. There is no "input I want to destroy" constraint; you have to synthesize that with a dummy output or variable.

This all applies to other compilers (clang and ICC) that support GNU C inline asm syntax.

From the GCC manual: Extended asm Input Operands:

Do not modify the contents of input-only operands (except for inputs tied to outputs). The compiler assumes that on exit from the asm statement these operands contain the same values as they had before executing the statement. It is not possible to use clobbers to inform the compiler that the values in these inputs are changing.

(An rax clobber makes it an error to use "a" as an input; clobbers and operands can't overlap.)


Example for 1: register input operands

int plain_C(int in) {   return (in+1) + in;  }

// buggy: modifies an input read-only operand
int bad_asm(int in) {
    int out;
    asm ("inc %%edi;\n\t mov %%edi, %0" : "=a"(out) : [in]"D"(in) );
    return out + in;
}

Compiled on the Godbolt compiler explorer

Notice that gcc's addl uses edi for in, even though inline asm used that register as an input. (And thus breaks because this buggy inline asm modifies the register). It happens to hold in+1 in this case. I used gcc9.1, but this is not new behaviour.

## gcc9.1 -O3 -fverbose-asm
bad(int):
        inc %edi;
         mov %edi, %eax         # out  (comment mentions out because I used %0)

        addl    %edi, %eax      # in, tmp86
        ret     

We fix that by telling the compiler that the same input register is also an output, so it can no longer count on that . (Or by using auto tmp = in; asm("..." : "+r"(tmp));)

int safe(int in) {
    int out;
    int dummy;
    asm ("inc %%edi;\n\t mov %%edi, %%eax"
     : "=a"(out),
       "=&D"(dummy)
     : [in]"1"(in)  // matching constraint, or "D" works.
    );
    return out + in;
}
# gcc9.1 again.
safe_asm(int):
        movl    %edi, %edx      # tmp89, in    compiler-generated save of in
          # start inline asm
        inc %edi;
         mov %edi, %eax
          # end inline asm
        addl    %edx, %eax      # in, tmp88
        ret

Obviously "lea 1(%%rdi), %0" would avoid the problems by not modifying the input in the first place, and so would mov/inc. This is an artificial example that intentionally destroys an input.


If the function does not inline and doesn't use the input variable after the asm statement, you typically get away with lying to the compiler, as long as it's a call-clobbered register.

It's not rare to find people that have written unsafe code that happens to work in the context they're using it in. It's also not rare for them to be convinced that simply testing it in that context with one compiler version/options is sufficient to verify its safety or correctness.

But that's not how asm works; the compiler trusts you to accurately describe the asm's behaviour, and simply does text substitution on the template part.

It would be a crappy missed optimization if gcc assumed that asm statements always destroyed their inputs. In fact, the same constraints that inline asm uses are (I think) used in the internal machine-description files that teach gcc about an ISA. (So destroyed inputs would be terrible for code-gen).

The whole design of GNU C inline asm is based around wrapping a single instruction, that's why even early-clobber for outputs isn't the default. You have to do that manually if necessary, if writing multiple instructions or a loop inside inline asm.


a potential bug that the compiler hasn't been told that memory that these pointers are pointing at has been read and or modified.

That's also correct. A register input operand does not imply that the pointed-to memory is also an input operand. In a function that can't inline, this can't actually cause problems, but as soon as you enable link-time optimization, cross-file inlining and inter-procedural optimization becomes possible.

There's an existing Informing clang that inline assembly reads a particular region of memory unanswered question. This Godbolt link shows some of the ways you can reveal this problem, e.g.

   arr[2] = 1;
   asm(...);
   arr[2] = 0;

If gcc assumes arr[2] isn't an input to the asm, only the arr address itself, it will do dead-store elimination and remove the =1 assignment. (Or look at it as reordering the store with the asm statement, then collapsing 2 stores to the same location).

An array is good because it shows that even "m"(*arr) doesn't work for a pointer, only an actual array. That input operand would only tell the compiler that arr[0] is an input, still not arr[2]. That's a good thing if that's all your asm reads, because it doesn't block optimization of other parts.

For that memset example, to properly declare that the pointed-to memory is an output operand, cast the pointer to a pointer-to-array and dereference it, to tell gcc that an entire range of memory is the operand. *(char (*)[count])pointer. (You can leave the [] empty to specify an arbitrary-length region of memory accessed via this pointer.)

// correct version written by @MichaelPetch.  
void *memset(void *dest, int value, size_t count)
{
  void *tmp = dest;
  asm ("rep stosb    # mem output is %2"
     : "+D"(tmp), "+c"(count),       // tell the compiler we modify the regs
       "=m"(*(char (*)[count])tmp)   // dummy memory output
     : "a"(value)                    // EAX actually is read-only
     : // no clobbers
  );
  return dest;
}

Including an asm comment using the dummy operand lets us see how the compiler allocated it. We can see the compiler picks (%rdi) with AT&T syntax, so it is willing to use a register that's also an input/output operand.

With an early-clobber on the output it might have wanted to use another register, but without that it doesn't cost us anything to gain correctness.

With a void function that doesn't return the pointer (or after inlining into a function that doesn't use the return value), it doesn't have to copy the pointer arg anywhere before letting rep stosb destroy it.