5
votes

Background

I am aware that solving the following problem with inline assembly is a bad idea. I'm currently learning inline assembly as part of a class on the linux kernel, and this was part of an assignment for that class.

The Setup

The begin with, below is a snippet of code that is almost correct, but instead segfaults. It is a function that copies the substring of src starting at index s_idx and ending (exclusively) at index e_idx into the pre-allocated dest using only inline assembly.

static inline char *asm_sub_str(char *dest, char *src, int s_idx, int e_idx) {
  asm("addq %q2, %%rsi;"  /* Add start index to src (ptrs are 64-bit) */
      "subl %k2, %%ecx;"  /* Get length of substr as e - s (int is 32-bit) */
      "cld;"              /* Clear direction bit (force increment) */
      "rep movsb;"        /* Move %ecx bytes of str at %esi into str at %edi */
      : /* No Ouputs */
      : "S" (src), "D" (dest), "g" (s_idx), "c" (e_idx)
      : "cc", "memory"
      );

  return dest;
}

The issue with this code is the constraint for the second input parameter. When compiled with gccs default optimization and -ggdb, the following assembly is generated:

Dump of assembler code for function asm_sub_str:
   0x00000000004008e6 <+0>:     push   %rbp
   0x00000000004008e7 <+1>:     mov    %rsp,%rbp
   0x00000000004008ea <+4>:     mov    %rdi,-0x8(%rbp)
   0x00000000004008ee <+8>:     mov    %rsi,-0x10(%rbp)
   0x00000000004008f2 <+12>:    mov    %edx,-0x14(%rbp)
   0x00000000004008f5 <+15>:    mov    %ecx,-0x18(%rbp)
   0x00000000004008f8 <+18>:    mov    -0x10(%rbp),%rax
   0x00000000004008fc <+22>:    mov    -0x8(%rbp),%rdx
   0x0000000000400900 <+26>:    mov    -0x18(%rbp),%ecx
   0x0000000000400903 <+29>:    mov    %rax,%rsi
   0x0000000000400906 <+32>:    mov    %rdx,%rdi
   0x0000000000400909 <+35>:    add    -0x14(%rbp),%rsi
   0x000000000040090d <+39>:    sub    -0x14(%rbp),%ecx
   0x0000000000400910 <+42>:    cld    
   0x0000000000400911 <+43>:    rep movsb %ds:(%rsi),%es:(%rdi)
   0x0000000000400913 <+45>:    mov    -0x8(%rbp),%rax
   0x0000000000400917 <+49>:    pop    %rbp
   0x0000000000400918 <+50>:    retq 

This is identical to the assembly that is generated when the second input parameter's constraint is set to "m" instead of "g", leading me to believe the compiler is effectively choosing the "m" constraint. In stepping through these instructions with gdb, I found that the offending instruction is +35 which adds the starting offset index s_idx to the src pointer in %rsi. The problem of course is that s_idx is only 32-bits and the upper 4 bytes of a 64-bit integer at that location on the static is not necessarily 0. On my machine, it is in fact nonzero and causes the addition to muddle the upper 4 bytes of %rsi which leads to a segfault in instruction +43.

The Question

Of course the solution to the above is to change the constraint of parameter 2 to "r" so it's placed in its own 64-bit register where the top 4 bytes are correctly zeroed and call it a day. Instead, my question is why does gcc resolve the "g" constraint as "m" instead of "r" in this case when the expression "%q2" indicates the value of parameter 2 will be used as a 64-bit value?

I don't know much about how gcc parses inline assembly, and I know there's not really a sense of typing in assembly, but I would think that gcc could recognize the effectively implicit cast of s_idx to a long when it's used as a 64-bit value in the first inline instruction. FWIW, if I explicitly change "g" (s_idx) to "g" ((long) s_idx), gcc then resolves the "g" constraint to "r" since (long) s_idx is a temporary value. I would think gcc could do that implicitly as well?

1
Using size overrides in the code has no effect on parameter assignment and is a hack that should be avoided if at all possible. What do you think the poor compiler should do if you used it multiple times with different sizes? PS: that code is horribly broken because it changes all of the inputs (except s_idx) without telling the compiler.Jester
Well, assuming I had used the "r" constraints, how would I even get around using the size overrides? Using just %2 in the first instruction won't compile because it needs a 64-bit register. As for the what the compiler should use when multiple sizes occur, I would think the largest size, just like assigning an int to a long causes an implicit cast.define cindy const
Also, what would you suggest I do about fixing the horribly broken part? Adding register %rsi, %rdi, or %rcx to the clobber list causes compile errors because they are already inputs and should be assumed implicitly clobbered (if I understand the compile error correctly). And adding the early clobber constraint & to the input constraints also won't compile.define cindy const
Some gcc versions do accept clobbers overlapping input operands, but the safe thing is to declare them as read-write output operands instead.Jester
You could do asm volatile( "cld;" "rep movsb;" : "=S"(src), "+D"(dest), "=c" (e_idx) : "0" (src+s_idx), "2" (e_idx-s_idx) : "cc", "memory" ); . Rather than compute the src and length in the inline template you can just get C to do it. It is possible to modify this in such a way to not need the "memory" constraint: asm ( "cld;" "rep movsb;" : "=S"(src), "+D"(dest), "=c" (e_idx), "=m"(*(char (*)[]) dest) : "0" (src+s_idx), "2" (e_idx-s_idx), "m"(*(const char (*)[]) src) : "cc" );Michael Petch

1 Answers

4
votes

but I would think that gcc could recognize the effectively implicit cast of s_idx to a long when it's used as a 64-bit value in the first inline instruction.

No, gcc only looks at the constraints, not the asm template string at all, when compiling the surrounding code. The part of gcc that fills in the % template operands is totally separate from register-allocation and code-gen for the surrounding code.

Nothing checks for sanity or understands the context that template operands are being used in. Maybe you have a 16-bit input and want to copy it to a vector register with vmovd %k[input], %%xmm0 / vpbroadcastw %%xmm0, %%ymm0. The upper 16 bits are ignored, so you don't want gcc to waste time zero or sign-extending it for you. But you definitely want to use vmovd instead of vpinsrw $0, %[input], %%xmm0, because that would be more uops and have a false dependency. For all gcc knows or cares, you could have used the operand in an asm comment line like "# low word of input = %h2 \n.

GNU C inline asm is designed so that the constraints tell the compiler everything it needs to know. Thus, you need to manually cast s_idx to long.

You don't need to cast the input for ECX, because the sub instruction will zero-extend the result implicitly (into RCX). Your inputs are signed types, but presumably you are expecting the difference to always be positive.

Register inputs must always be assumed to have high garbage beyond the width of the input type. This is similar to how function args in the x86-64 System V calling convention can have can have garbage in the upper 32 bits, but (I assume) with no unwritten rule about extending out to 32 bits. (And note that after function inlining, your asm statement's inputs might not be function args. You don't want to use __attribute__((noinline)), and as I said it wouldn't help anyway.)


leading me to believe the compiler is effectively choosing the "m" constraint.

Yes, gcc -O0 spills everything to memory between every C statement (so you can change it with a debugger if stopped at a breakpoint). Thus, a memory operand is the most efficient choice for the compiler. It would need a load instruction to get it back into a register. i.e. the value is in memory before the asm statement, at -O0.

(clang is bad at multiple-option constraints and picks memory even at -O3, even when that means spilling first, but gcc doesn't have that problem.)

gcc -O0 (and clang) will use an immediate for a g constraint when the input is a numeric literal constant, e.g. "g" (1234). In your case, you get:

    ...
    addq $1234, %rsi; 
    subl $1234, %ecx; 
    rep movsb
    ...

An input like "g" ((long)s_idx) will use a register even at -O0, just like x+y or any other temporary result (as long as s_idx isn't already long). Interestingly, even (unsigned) resulted in a register operand, even though int and unsigned are the same size and the cast takes no instructions. At this point you're seeing exactly how little gcc -O0 optimizes, because what you get is more dependent on how gcc internals are designed than on what makes sense or is efficient.


Compile with optimization enabled if you want to see interesting asm. See How to remove "noise" from GCC/clang assembly output?, especially the link to Matt Godbolt's CppCon2017 talk about looking at compiler output.

Although checking the asm without optimizations disabled is good, too for inline asm; you might not have realized the problem with using a q override if it was just registers, although it is still a problem. Checking how it inlines into a few different callers at -O3 can be useful, too (especially if you test with some compile-time-constant inputs).


Your code is seriously broken

Besides the high-garbage problems discussed above, you modify input-operand registers without telling the compiler about it.

Fixing this by making some of them "+" read/write outputs means your asm statement is no longer volatile by default, so the compiler will optimize it away if the outputs are unused. (This includes after function inlining, so the return dest is sufficient for the standalone version, but not after inlining if the caller ignores the return value.)

You did use a "memory" clobber, so the compiler will assume that you read/write memory. You could tell it which memory you read and write, so it can optimize around your copy more efficiently. See get string length in inline GNU Assembler: you can use dummy memory input/output constraints like "m" (*(const char (*)[]) src)

char *asm_sub_str_fancyconstraints(char *dest, char *src, int s_idx, int e_idx) {
  asm (
      "addq %[s_idx], %%rsi; \n\t"  /* Add start index to src (ptrs are 64-bit) */
      "subl %k[s_idx], %%ecx;          \n\t"  /* Get length of substr as e - s (int is 32-bit) */

      // the calling convention requires DF=0, and inline-asm can safely assume it, too
      // (it's widely done, including in the Linux kernel)
      //"cld;"              /* Clear direction bit (force increment) */

      "rep movsb;                \n\t"        /* Move %ecx bytes of str at %esi into str at %edi */
      : [src]"+&S" (src), [dest]"+D" (dest), [e_idx]"+c" (e_idx)
        , "=m" (*(char (*)[]) dest)     // dummy output: all of dest
      : [s_idx]"g" ((long long)s_idx)
        , "m" (*(const char (*)[]) src) // dummy input: tell the compiler we read all of src[0..infinity]
      : "cc"
      );

  return 0; // asm statement not optimized away, even without volatile,
            //  because of the memory output.
            // Just like dest++; could optimize away, but *dest = 0; couldn't.
}

formatting: note the use of \n\t at the end of each line for readability; otherwise the asm instructions are all on one line separated only by ;. (It will assemble fine, but not very human-readable if you're checking how your asm template worked out.)

This compiles (with gcc -O3) to

asm_sub_str_fancyconstraints:
    movslq  %edx, %rdx        # from the (long long)s_idx
    xorl    %eax, %eax        # from the return 0, which I changed to test that it doesn't optimize away
    addq %rdx, %rsi; 
    subl %edx, %ecx;          # your code zero-extends (e_idx - s_idx)
    rep movsb;                

    ret

I put this + a couple other versions on the Godbolt compiler explorer with gcc + clang. A simpler version fixes the bugs but still uses a "memory" clobber + asm volatile to get correctness with more compile-time optimization cost than this version that tells the compiler which memory is read and written.


Early clobber: Note the "+&S" constraint:

If for some weird reason, the compiler knew that the src address and s_idx were equal, it could use the same register (esi/rsi) for both inputs. This would lead to modifying s_idx before it was used in the sub. Declaring that the register holding src is clobbered early (before all input registers are read for the last time) will force the compiler to choose different registers.

See the Godbolt link above for a caller that causes breakage without the & for early-clobber. (But only with the nonsensical src = (char*)s_idx;). Early-clobber declarations are often necessary for multi-instruction asm statements to prevent more realistic breakage possibilities, so definitely keep this in mind, and only leave it out when you're sure it's ok for any read-only input to share a register with an output or input/output operand. (Of course using specific-register constraints limits that possibility.)

I omitted the early-clobber declaration from e_idx in ecx, because the only "free" parameter is s_idx, and putting them both in the same register will result in sub same,same, and rep movsb running 0 iterations as desired.


It would of course be more efficient to let the compiler do the math, and simply ask for the inputs to rep movsb in the right registers. Especially if both e_idx and s_idx are compile-time constants, it's silly to force the compiler to mov an immediate to a register and then subtract another immediate.

Or even better, don't use inline asm at all. (But if you really want rep movsb to test its performance, inline asm is one way to do it. gcc also has tuning options that control how memcpy inlines, if at all.)

No inline asm answer is complete without recommending that you https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it.