but I would think that gcc could recognize the effectively implicit cast of s_idx
to a long
when it's used as a 64-bit value in the first inline instruction.
No, gcc only looks at the constraints, not the asm
template string at all, when compiling the surrounding code. The part of gcc that fills in the %
template operands is totally separate from register-allocation and code-gen for the surrounding code.
Nothing checks for sanity or understands the context that template operands are being used in. Maybe you have a 16-bit input and want to copy it to a vector register with vmovd %k[input], %%xmm0
/ vpbroadcastw %%xmm0, %%ymm0
. The upper 16 bits are ignored, so you don't want gcc to waste time zero or sign-extending it for you. But you definitely want to use vmovd
instead of vpinsrw $0, %[input], %%xmm0
, because that would be more uops and have a false dependency. For all gcc knows or cares, you could have used the operand in an asm comment line like "# low word of input = %h2 \n
.
GNU C inline asm is designed so that the constraints tell the compiler everything it needs to know. Thus, you need to manually cast s_idx
to long
.
You don't need to cast the input for ECX, because the sub
instruction will zero-extend the result implicitly (into RCX). Your inputs are signed types, but presumably you are expecting the difference to always be positive.
Register inputs must always be assumed to have high garbage beyond the width of the input type. This is similar to how function args in the x86-64 System V calling convention can have can have garbage in the upper 32 bits, but (I assume) with no unwritten rule about extending out to 32 bits. (And note that after function inlining, your asm statement's inputs might not be function args. You don't want to use __attribute__((noinline))
, and as I said it wouldn't help anyway.)
leading me to believe the compiler is effectively choosing the "m" constraint.
Yes, gcc -O0
spills everything to memory between every C statement (so you can change it with a debugger if stopped at a breakpoint). Thus, a memory operand is the most efficient choice for the compiler. It would need a load instruction to get it back into a register. i.e. the value is in memory before the asm
statement, at -O0
.
(clang is bad at multiple-option constraints and picks memory even at -O3
, even when that means spilling first, but gcc doesn't have that problem.)
gcc -O0
(and clang
) will use an immediate for a g
constraint when the input is a numeric literal constant, e.g. "g" (1234)
. In your case, you get:
...
addq $1234, %rsi;
subl $1234, %ecx;
rep movsb
...
An input like "g" ((long)s_idx)
will use a register even at -O0
, just like x+y
or any other temporary result (as long as s_idx
isn't already long
). Interestingly, even (unsigned)
resulted in a register operand, even though int
and unsigned
are the same size and the cast takes no instructions. At this point you're seeing exactly how little gcc -O0
optimizes, because what you get is more dependent on how gcc internals are designed than on what makes sense or is efficient.
Compile with optimization enabled if you want to see interesting asm. See How to remove "noise" from GCC/clang assembly output?, especially the link to Matt Godbolt's CppCon2017 talk about looking at compiler output.
Although checking the asm without optimizations disabled is good, too for inline asm; you might not have realized the problem with using a q
override if it was just registers, although it is still a problem. Checking how it inlines into a few different callers at -O3
can be useful, too (especially if you test with some compile-time-constant inputs).
Your code is seriously broken
Besides the high-garbage problems discussed above, you modify input-operand registers without telling the compiler about it.
Fixing this by making some of them "+"
read/write outputs means your asm statement is no longer volatile
by default, so the compiler will optimize it away if the outputs are unused. (This includes after function inlining, so the return dest
is sufficient for the standalone version, but not after inlining if the caller ignores the return value.)
You did use a "memory"
clobber, so the compiler will assume that you read/write memory. You could tell it which memory you read and write, so it can optimize around your copy more efficiently. See get string length in inline GNU Assembler: you can use dummy memory input/output constraints like "m" (*(const char (*)[]) src)
char *asm_sub_str_fancyconstraints(char *dest, char *src, int s_idx, int e_idx) {
asm (
"addq %[s_idx], %%rsi; \n\t" /* Add start index to src (ptrs are 64-bit) */
"subl %k[s_idx], %%ecx; \n\t" /* Get length of substr as e - s (int is 32-bit) */
// the calling convention requires DF=0, and inline-asm can safely assume it, too
// (it's widely done, including in the Linux kernel)
//"cld;" /* Clear direction bit (force increment) */
"rep movsb; \n\t" /* Move %ecx bytes of str at %esi into str at %edi */
: [src]"+&S" (src), [dest]"+D" (dest), [e_idx]"+c" (e_idx)
, "=m" (*(char (*)[]) dest) // dummy output: all of dest
: [s_idx]"g" ((long long)s_idx)
, "m" (*(const char (*)[]) src) // dummy input: tell the compiler we read all of src[0..infinity]
: "cc"
);
return 0; // asm statement not optimized away, even without volatile,
// because of the memory output.
// Just like dest++; could optimize away, but *dest = 0; couldn't.
}
formatting: note the use of \n\t
at the end of each line for readability; otherwise the asm instructions are all on one line separated only by ;
. (It will assemble fine, but not very human-readable if you're checking how your asm template worked out.)
This compiles (with gcc -O3) to
asm_sub_str_fancyconstraints:
movslq %edx, %rdx # from the (long long)s_idx
xorl %eax, %eax # from the return 0, which I changed to test that it doesn't optimize away
addq %rdx, %rsi;
subl %edx, %ecx; # your code zero-extends (e_idx - s_idx)
rep movsb;
ret
I put this + a couple other versions on the Godbolt compiler explorer with gcc + clang. A simpler version fixes the bugs but still uses a "memory"
clobber + asm volatile
to get correctness with more compile-time optimization cost than this version that tells the compiler which memory is read and written.
Early clobber: Note the "+&S"
constraint:
If for some weird reason, the compiler knew that the src
address and s_idx
were equal, it could use the same register (esi/rsi
) for both inputs. This would lead to modifying s_idx
before it was used in the sub
. Declaring that the register holding src
is clobbered early (before all input registers are read for the last time) will force the compiler to choose different registers.
See the Godbolt link above for a caller that causes breakage without the &
for early-clobber. (But only with the nonsensical src = (char*)s_idx;
). Early-clobber declarations are often necessary for multi-instruction asm statements to prevent more realistic breakage possibilities, so definitely keep this in mind, and only leave it out when you're sure it's ok for any read-only input to share a register with an output or input/output operand. (Of course using specific-register constraints limits that possibility.)
I omitted the early-clobber declaration from e_idx
in ecx
, because the only "free" parameter is s_idx
, and putting them both in the same register will result in sub same,same
, and rep movsb
running 0 iterations as desired.
It would of course be more efficient to let the compiler do the math, and simply ask for the inputs to rep movsb
in the right registers. Especially if both e_idx
and s_idx
are compile-time constants, it's silly to force the compiler to mov
an immediate to a register and then subtract another immediate.
Or even better, don't use inline asm at all. (But if you really want rep movsb
to test its performance, inline asm is one way to do it. gcc also has tuning options that control how memcpy
inlines, if at all.)
No inline asm answer is complete without recommending that you https://gcc.gnu.org/wiki/DontUseInlineAsm if you can avoid it.
s_idx
) without telling the compiler. – Jester"r"
constraints, how would I even get around using the size overrides? Using just %2 in the first instruction won't compile because it needs a 64-bit register. As for the what the compiler should use when multiple sizes occur, I would think the largest size, just like assigning anint
to along
causes an implicit cast. – define cindy const%rsi
,%rdi
, or%rcx
to the clobber list causes compile errors because they are already inputs and should be assumed implicitly clobbered (if I understand the compile error correctly). And adding the early clobber constraint&
to the input constraints also won't compile. – define cindy constasm volatile( "cld;" "rep movsb;" : "=S"(src), "+D"(dest), "=c" (e_idx) : "0" (src+s_idx), "2" (e_idx-s_idx) : "cc", "memory" );
. Rather than compute the src and length in the inline template you can just get C to do it. It is possible to modify this in such a way to not need the "memory" constraint:asm ( "cld;" "rep movsb;" : "=S"(src), "+D"(dest), "=c" (e_idx), "=m"(*(char (*)[]) dest) : "0" (src+s_idx), "2" (e_idx-s_idx), "m"(*(const char (*)[]) src) : "cc" );
– Michael Petch