I was engaged with an expert who allegedly has vastly superior coding skills than myself who understands inline assembly far better than I ever could.
One of the claims is that as long as an operand appears as an input constraint, you don't need to list it as a clobber or specify that the register has been potentially modified by the inline assembly. The conversation came about when someone else was trying to get assistance on a memset
implementation that was effectively coded this way:
void *memset(void *dest, int value, size_t count)
{
asm volatile ("cld; rep stosb" :: "D"(dest), "c"(count), "a"(value));
return dest;
}
The expert's claim when I commented about the issue with clobbering registers without telling the compiler, was to tell us that:
"c"(count) already tells the compiler c is clobbered
I found an example in the expert's own operating system where they write similar code with the same design pattern. They use Intel syntax for their inline assembly. This hobby operating system code operates in a kernel (ring0) context. An example is this buffer swap function1:
void swap_vbufs(void) {
asm volatile (
"1: "
"lodsd;"
"cmp eax, dword ptr ds:[rbx];"
"jne 2f;"
"add rdi, 4;"
"jmp 3f;"
"2: "
"stosd;"
"3: "
"add rbx, 4;"
"dec rcx;"
"jnz 1b;"
:
: "S" (antibuffer0),
"D" (framebuffer),
"b" (antibuffer1),
"c" ((vbe_pitch / sizeof(uint32_t)) * vbe_height)
: "rax"
);
return;
}
antibuffer0
, antibuffer1
, and framebuffer
are all buffers in memory treated as arrays of uint32_t
. framebuffer
is actual video memory (MMIO) and antibuffer0
, antibuffer1
are buffers allocated in memory.
The global variables are properly set up before this function is called. They are declared as:
volatile uint32_t *framebuffer;
volatile uint32_t *antibuffer0;
volatile uint32_t *antibuffer1;
int vbe_width = 1024;
int vbe_height = 768;
int vbe_pitch;
My Questions and Concerns about this Kind of Code
As an apparent neophyte to inline assembly having an apparent naive understanding of the subject, I'm wondering whether my apparent uneducated belief this code is potentially very buggy is correct. I want to know if these concerns have any merit:
RDI, RSI, RBX, and RCX are all modified by this code. RDI and RSI are incremented by LODSD and STOSD implicitly. The rest are modified explicitly with
"add rbx, 4;" "dec rcx;"
None of these registers are listed as input/output nor are they listed as output operands. I believe these constraints need to be modified to inform the compiler that these registers may have been modified/clobbered. The only register that is listed as clobbered which I believe is correct is RAX. Is my understanding correct? My feeling is that RDI, RSI, RBX, and RCX should be input/output constraints (Using the
+
modifier). Even if one tries to argue that the 64-bit System V ABI calling convention will save them (assumptions that a poor way IMHO to write such code) RBX is a non-volatile register that will change in this code.Since the addresses are passed via registers (and not memory constraints), I believe it is a potential bug that the compiler hasn't been told that memory that these pointers are pointing at has been read and/or modified. Is my understanding correct?
RBX, and RCX are hard coded registers. Wouldn't it make sense to allow the compiler to choose these registers automatically via the constraints?
If one assumes that inline assembly has to be used here (hypothetically) what would bug free GCC inline assembly code look like for this function? Is this function fine as is, and I just don't understand the basics of GCC's extended inline assembly like the expert does?
Footnotes
- 1The
swap_vbufs
function and associated variable declarations have been reproduced verbatim without the copyright holder's permission under fair use for purposes of commentary about a larger body of work.
swap_vbufs
doesn't look optimal. I think it's just a copy-if-different. I guess if diffs are rare, the branch misses here might be worth avoiding MMIO stores, but arep movsd
ormovnti
unconditional streaming store to video RAM might be fine. Either way, it's not the most efficient implementation of the loop; usingstosd
makes it worse because it requires anelse add rdi,4
branch with an extra jmp on the (hopefully) fast path. – Peter Cordesasm
when I first discovered it. But over time, I came to understand that despite its power, it's almost always a bad idea. It gives interesting insight into how the compiler works, but I'd think long and hard before using it in production code. That said, I mentally add 1 to my karma every time it gets referenced... – David Wohlferd