In many cases, the optimal way to perform some task may depend upon the context in which the task is performed. If a routine is written in assembly language, it will generally not be possible for the sequence of instructions to be varied based upon context. As a simple example, consider the following simple method:
inline void set_port_high(void)
{
(*((volatile unsigned char*)0x40001204) = 0xFF);
}
A compiler for 32-bit ARM code, given the above, would likely render it as something like:
ldr r0,=0x40001204
mov r1,#0
strb r1,[r0]
[a fourth word somewhere holding the constant 0x40001204]
or perhaps
ldr r0,=0x40001000 ; Some assemblers like to round pointer loads to multiples of 4096
mov r1,#0
strb r1,[r0+0x204]
[a fourth word somewhere holding the constant 0x40001000]
That could be optimized slightly in hand-assembled code, as either:
ldr r0,=0x400011FF
strb r0,[r0+5]
[a third word somewhere holding the constant 0x400011FF]
or
mvn r0,#0xC0 ; Load with 0x3FFFFFFF
add r0,r0,#0x1200 ; Add 0x1200, yielding 0x400011FF
strb r0,[r0+5]
Both of the hand-assembled approaches would require 12 bytes of code space rather than 16; the latter would replace a "load" with an "add", which would on an ARM7-TDMI execute two cycles faster. If the code was going to be executed in a context where r0 was don't-know/don't-care, the assembly language versions would thus be somewhat better than the compiled version. On the other hand, suppose the compiler knew that some register [e.g. r5] was going to hold a value that was within 2047 bytes of the desired address 0x40001204 [e.g. 0x40001000], and further knew that some other register [e.g. r7] was going to hold a value whose low bits were 0xFF. In that case, a compiler could optimize the C version of the code to simply:
strb r7,[r5+0x204]
Much shorter and faster than even the hand-optimized assembly code. Further, suppose set_port_high occurred in the context:
int temp = function1();
set_port_high();
function2(temp); // Assume temp is not used after this
Not at all implausible when coding for an embedded system. If set_port_high
is written in assembly code, the compiler would have to move r0 (which holds the return value from function1
) somewhere else before invoking the assembly code, and then move that value back to r0 afterward (since function2
will expect its first parameter in r0), so the "optimized" assembly code would need five instructions. Even if the compiler didn't know of any registers holding the address or the value to store, its four-instruction version (which it could adapt to use any available registers--not necessarily r0 and r1) would beat the "optimized" assembly-language version. If the compiler had the necessary address and data in r5 and r7 as described earlier, function1
would not alter those registers, and thus it could replace set_port_high
with a single strb
instruction--four instructions smaller and faster than the "hand-optimized" assembly code.
Note that hand-optimized assembly code can often outperform a compiler in cases where the programmer knows the precise program flow, but compilers shine in cases where a piece of code is written before its context is known, or where one piece of source code may be invoked from multiple contexts [if set_port_high
is used in fifty different places in the code, the compiler could independently decide for each of those how best to expand it].
In general, I would suggest that assembly language is apt to yield the greatest performance improvements in those cases where each piece of code can be approached from a very limited number of contexts, and is apt to be detrimental to performance in places where a piece of code may be approached from many different contexts. Interestingly (and conveniently) the cases where assembly is most beneficial to performance are often those where the code is most straightforward and easy to read. The places that assembly language code would turn into a gooey mess are often those where writing in assembly would offer the smallest performance benefit.
[Minor note: there are some places where assembly code can be used to yield a hyper-optimized gooey mess; for example, one piece of code I did for the ARM needed to fetch a word from RAM and execute one of about twelve routines based upon the upper six bits of the value (many values mapped to the same routine). I think I optimized that code to something like:
ldrh r0,[r1],#2! ; Fetch with post-increment
ldrb r1,[r8,r0 asr #10]
sub pc,r8,r1,asl #2
The register r8 always held the address of the main dispatch table (within the loop where the code spend 98% of its time, nothing ever used it for any other purpose); all 64 entries referred to addresses in the 256 bytes preceding it. Since the primary loop had in most cases a hard execution-time limit of about 60 cycles, the nine-cycle fetch and dispatch was very instrumental toward meeting that goal. Using a table of 256 32-bit addresses would have been one cycle faster, but would have gobbled up 1KB of very precious RAM [flash would have added more than one wait state]. Using 64 32-bit addresses would have required adding an instruction to mask off some bits from the fetched word, and would still have gobbled up 192 more bytes than the table I actually used. Using the table of 8-bit offsets yielded very compact and fast code, but not something I would expect a compiler would ever come up with; I also would not expect a compiler to dedicate a register "full time" to holding the table address.
The above code was designed to run as a self-contained system; it could periodically call C code, but only at certain times when the hardware with which it was communicating could safely be put into an "idle" state for two roughly-one-millisecond intervals every 16ms.