Is inline assembly language slower than native C++ code?

188

votes

I tried to compare the performance of inline assembly language and C++ code, so I wrote a function that add two arrays of size 2000 for 100000 times. Here's the code:

#define TIMES 100000
void calcuC(int *x,int *y,int length)
{
    for(int i = 0; i < TIMES; i++)
    {
        for(int j = 0; j < length; j++)
            x[j] += y[j];
    }
}


void calcuAsm(int *x,int *y,int lengthOfArray)
{
    __asm
    {
        mov edi,TIMES
        start:
        mov esi,0
        mov ecx,lengthOfArray
        label:
        mov edx,x
        push edx
        mov eax,DWORD PTR [edx + esi*4]
        mov edx,y
        mov ebx,DWORD PTR [edx + esi*4]
        add eax,ebx
        pop edx
        mov [edx + esi*4],eax
        inc esi
        loop label
        dec edi
        cmp edi,0
        jnz start
    };
}

Here's main():

int main() {
    bool errorOccured = false;
    setbuf(stdout,NULL);
    int *xC,*xAsm,*yC,*yAsm;
    xC = new int[2000];
    xAsm = new int[2000];
    yC = new int[2000];
    yAsm = new int[2000];
    for(int i = 0; i < 2000; i++)
    {
        xC[i] = 0;
        xAsm[i] = 0;
        yC[i] = i;
        yAsm[i] = i;
    }
    time_t start = clock();
    calcuC(xC,yC,2000);

    //    calcuAsm(xAsm,yAsm,2000);
    //    for(int i = 0; i < 2000; i++)
    //    {
    //        if(xC[i] != xAsm[i])
    //        {
    //            cout<<"xC["<<i<<"]="<<xC[i]<<" "<<"xAsm["<<i<<"]="<<xAsm[i]<<endl;
    //            errorOccured = true;
    //            break;
    //        }
    //    }
    //    if(errorOccured)
    //        cout<<"Error occurs!"<<endl;
    //    else
    //        cout<<"Works fine!"<<endl;

    time_t end = clock();

    //    cout<<"time = "<<(float)(end - start) / CLOCKS_PER_SEC<<"\n";

    cout<<"time = "<<end - start<<endl;
    return 0;
}

Then I run the program five times to get the cycles of processor, which could be seen as time. Each time I call one of the function mentioned above only.

And here comes the result.

Function of assembly version:

Debug   Release
---------------
732        668
733        680
659        672
667        675
684        694
Average:   677

Function of C++ version:

Debug     Release
-----------------
1068      168
 999      166
1072      231
1002      166
1114      183
Average:  182

The C++ code in release mode is almost 3.7 times faster than the assembly code. Why?

I guess that the assembly code I wrote is not as effective as those generated by GCC. It's hard for a common programmer like me to wrote code faster than its opponent generated by a compiler.Does that mean I should not trust the performance of assembly language written by my hands, focus on C++ and forget about assembly language?

c++c performanceassembly

Pretty much. Handcoded assembly is appropriate in some circumstances, but care must be taken to ensure that the assembly version is indeed faster than what can be achieved with a higher level language. - Magnus Hoff

You might find it instructive to study the code generated by the compiler, and try to understand why it's faster than your assembly version. - Paul R

Yeah, looks like the compiler is better at writing asm than you. Modern compilers really are quite good. - David Heffernan

Have you looked at the assembly GCC produced? Its possible GCC used MMX instructions. Your function is very parallel - you could potentially use N processors to compute the sum in 1/N th the time. Try a function where there is no hope for parallelization. - Chris

Hm, I would have expected a good compiler to do this ~100000 times faster... - PlasmaHH

275

votes

Yes, most times.

First of all you start from wrong assumption that a low-level language (assembly in this case) will always produce faster code than high-level language (C++ and C in this case). It's not true. Is C code always faster than Java code? No because there is another variable: programmer. The way you write code and knowledge of architecture details greatly influence performance (as you saw in this case).

You can always produce an example where handmade assembly code is better than compiled code but usually it's a fictional example or a single routine not a true program of 500.000+ lines of C++ code). I think compilers will produce better assembly code 95% times and sometimes, only some rare times, you may need to write assembly code for few, short, highly used, performance critical routines or when you have to access features your favorite high-level language does not expose. Do you want a touch of this complexity? Read this awesome answer here on SO.

Why this?

First of all because compilers can do optimizations that we can't even imagine (see this short list) and they will do them in seconds (when we may need days).

When you code in assembly you have to make well-defined functions with a well-defined call interface. However they can take in account whole-program optimization and inter-procedural optimization such as register allocation, constant propagation, common subexpression elimination, instruction scheduling and other complex, not obvious optimizations (Polytope model, for example). On RISC architecture guys stopped worrying about this many years ago (instruction scheduling, for example, is very hard to tune by hand) and modern CISC CPUs have very long pipelines too.

For some complex microcontrollers even system libraries are written in C instead of assembly because their compilers produce a better (and easy to maintain) final code.

Compilers sometimes can automatically use some MMX/SIMDx instructions by themselves, and if you don't use them you simply can't compare (other answers already reviewed your assembly code very well). Just for loops this is a short list of loop optimizations of what is commonly checked for by a compiler (do you think you could do it by yourself when your schedule has been decided for a C# program?) If you write something in assembly, I think you have to consider at least some simple optimizations. The school-book example for arrays is to unroll the cycle (its size is known at compile time). Do it and run your test again.

These days it's also really uncommon to need to use assembly language for another reason: the plethora of different CPUs. Do you want to support them all? Each has a specific microarchitecture and some specific instruction sets. They have different number of functional units and assembly instructions should be arranged to keep them all busy. If you write in C you may use PGO but in assembly you will then need a great knowledge of that specific architecture (and rethink and redo everything for another architecture). For small tasks the compiler usually does it better, and for complex tasks usually the work isn't repaid (and compiler may do better anyway).

If you sit down and you take a look at your code probably you'll see that you'll gain more to redesign your algorithm than to translate to assembly (read this great post here on SO), there are high-level optimizations (and hints to compiler) you can effectively apply before you need to resort to assembly language. It's probably worth to mention that often using intrinsics you will have performance gain your're looking for and compiler will still be able to perform most of its optimizations.

All this said, even when you can produce a 5~10 times faster assembly code, you should ask your customers if they prefer to pay one week of your time or to buy a 50$ faster CPU. Extreme optimization more often than not (and especially in LOB applications) is simply not required from most of us.

195

votes

Your assembly code is suboptimal and may be improved:

You are pushing and popping a register (EDX) in your inner loop. This should be moved out of the loop.
You reload the array pointers in every iteration of the loop. This should moved out of the loop.
You use the loop instruction, which is known to be dead slow on most modern CPUs (possibly a result of using an ancient assembly book*)
You take no advantage of manual loop unrolling.
You don't use available SIMD instructions.

So unless you vastly improve your skill-set regarding assembler, it doesn't make sense for you to write assembler code for performance.

*Of course I don't know if you really got the loop instruction from an ancient assembly book. But you almost never see it in real world code, as every compiler out there is smart enough to not emit loop, you only see it in IMHO bad and outdated books.

60

votes

Even before delving into assembly, there are code transformations that exist at a higher level.

static int const TIMES = 100000;

void calcuC(int *x, int *y, int length) {
  for (int i = 0; i < TIMES; i++) {
    for (int j = 0; j < length; j++) {
      x[j] += y[j];
    }
  }
}

can be transformed into via Loop Rotation:

static int const TIMES = 100000;

void calcuC(int *x, int *y, int length) {
    for (int j = 0; j < length; ++j) {
      for (int i = 0; i < TIMES; ++i) {
        x[j] += y[j];
      }
    }
}

which is much better as far as memory locality goes.

This could be optimizes further, doing a += b X times is equivalent to doing a += X * b so we get:

static int const TIMES = 100000;

void calcuC(int *x, int *y, int length) {
    for (int j = 0; j < length; ++j) {
      x[j] += TIMES * y[j];
    }
}

however it seems my favorite optimizer (LLVM) does not perform this transformation.

[edit] I found that the transformation is performed if we had the restrict qualifier to x and y. Indeed without this restriction, x[j] and y[j] could alias to the same location which makes this transformation erroneous. [end edit]

Anyway, this is, I think, the optimized C version. Already it is much simpler. Based on this, here is my crack at ASM (I let Clang generate it, I am useless at it):

calcuAsm:                               # @calcuAsm
.Ltmp0:
    .cfi_startproc
# BB#0:
    testl   %edx, %edx
    jle .LBB0_2
    .align  16, 0x90
.LBB0_1:                                # %.lr.ph
                                        # =>This Inner Loop Header: Depth=1
    imull   $100000, (%rsi), %eax   # imm = 0x186A0
    addl    %eax, (%rdi)
    addq    $4, %rsi
    addq    $4, %rdi
    decl    %edx
    jne .LBB0_1
.LBB0_2:                                # %._crit_edge
    ret
.Ltmp1:
    .size   calcuAsm, .Ltmp1-calcuAsm
.Ltmp2:
    .cfi_endproc

I am afraid I don't understand where all those instructions come from, however you can always have fun and try and see how it compares... but I'd still use the optimized C version rather than the assembly one, in code, much more portable.

42

votes

Short answer: yes.

Long answer: yes, unless you really know what you're doing, and have a reason to do so.

35

votes

I have fixed my asm code:

  __asm
{   
    mov ebx,TIMES
 start:
    mov ecx,lengthOfArray
    mov esi,x
    shr ecx,1
    mov edi,y
label:
    movq mm0,QWORD PTR[esi]
    paddd mm0,QWORD PTR[edi]
    add edi,8
    movq QWORD PTR[esi],mm0
    add esi,8
    dec ecx 
    jnz label
    dec ebx
    jnz start
};

Results for Release version:

 Function of assembly version: 81
 Function of C++ version: 161

The assembly code in release mode is almost 2 times faster than the C++.

24

votes

Does that mean I should not trust the performance of assembly language written by my hands

Yes, that is exactly what it means, and it is true for every language. If you don't know how to write efficient code in language X, then you should not trust your ability to write efficient code in X. And so, if you want efficient code, you should use another language.

Assembly is particularly sensitive to this, because, well, what you see is what you get. You write the specific instructions that you want the CPU to execute. With high level languages, there is a compiler in betweeen, which can transform your code and remove many inefficiencies. With assembly, you're on your own.

22

votes

The only reason to use assembly language nowadays is to use some features not accessible by the language.

This applies to:

Kernel programming that needs to access to certain hardware features such as the MMU
High performance programming that uses very specific vector or multimedia instructions not supported by your compiler.

But current compilers are quite smart, they can even replace two separate statements like d = a / b; r = a % b; with a single instruction that calculates the division and remainder in one go if it is available, even if C does not have such operator.

20

votes

It is true that a modern compiler does an amazing job at code optimization, yet I would still encourage you to keep on learning assembly.

First of all you are clearly not intimidated by it, that's a great, great plus, next - you're on the right track by profiling in order to validate or discard your speed assumptions, you are asking for input from experienced people, and you have the greatest optimizing tool known to mankind: a brain.

As your experience increases, you'll learn when and where to use it (usually the tightest, innermost loops in your code, after you have deeply optimized at an algorithmic level).

For inspiration I would recommend you lookup Michael Abrash's articles (if you haven't heard from him, he is an optimization guru; he even collaborated with John Carmack in the optimization of the Quake software renderer!)

"there ain't no such thing as the fastest code" - Michael Abrash

14

votes

I have changed asm code:

 __asm
{ 
    mov ebx,TIMES
 start:
    mov ecx,lengthOfArray
    mov esi,x
    shr ecx,2
    mov edi,y
label:
    mov eax,DWORD PTR [esi]
    add eax,DWORD PTR [edi]
    add edi,4   
    dec ecx 
    mov DWORD PTR [esi],eax
    add esi,4
    test ecx,ecx
    jnz label
    dec ebx
    test ebx,ebx
    jnz start
};

Results for Release version:

 Function of assembly version: 41
 Function of C++ version: 161

The assembly code in release mode is almost 4 times faster than the C++. IMHo, the speed of assembly code depends from Programmer

13

votes

it is very interesting topic!
I have changed the MMX by SSE in Sasha's code
Here is my results:

Function of C++ version:      315
Function of assembly(simply): 312
Function of assembly  (MMX):  136
Function of assembly  (SSE):  62

The assembly code with SSE is 5 times faster than the C++

12

votes

Most high-level languages compilers are very optimized and know what they are doing. You can try and dump the disassemble code and compare it with your native assembly. I believe you will see some nice tricks that your compiler is using.

Just for example, even that I am not sure it is right any more :) :

Doing:

mov eax,0

cost more cycles than

xor eax,eax

which does the same thing.

The compiler knows all these tricks and uses them.

10

votes

The compiler beat you. I'll give it a try, but I won't make any guarantees. I will assume that the "multiplication" by TIMES is meant to make it a more relevant performance test, that y and x are 16-aligned, and that length is a non-zero multiple of 4. That's probably all true anyway.

  mov ecx,length
  lea esi,[y+4*ecx]
  lea edi,[x+4*ecx]
  neg ecx
loop:
  movdqa xmm0,[esi+4*ecx]
  paddd xmm0,[edi+4*ecx]
  movdqa [edi+4*ecx],xmm0
  add ecx,4
  jnz loop

Like I said, I make no guarantees. But I'll be surprised if it can be done much faster - the bottleneck here is memory throughput even if everything is a L1 hit.

7

votes

Just blindly implementing the exact same algorithm, instruction by instruction, in assembly is guaranteed to be slower than what the compiler can do.

It's because even the smallest optimization the compiler does is better than your rigid code with no optimization at all.

Of course, it is possible to beat the compiler, especially if it's a small, localized part of the code, I even had to do it myself to get an approx. 4x speed up, but in this case we have to heavily rely on good knowledge of the hardware and numerous, seemingly counter-intuitive tricks.

5

votes

As a compiler i would replace a loop with a fixed size to a lot of execution tasks.

int a = 10;
for (int i = 0; i < 3; i += 1) {
    a = a + i;
}

will produce

int a = 10;
a = a + 0;
a = a + 1;
a = a + 2;

and eventually it will know that "a = a + 0;" is useless so it will remove this line. Hopefully something in your head now willing to attach some optimization options as a comment. All those very effective optimizations will make the compiled language faster.

4

votes

It's exactly what it means. Leave the micro-optimizations to the compiler.

4

votes

I love this example because it demonstrates an important lesson about low-level code. Yes, you can write assembly that is as fast as your C code. This is tautologically true, but doesn't necessarily mean anything. Clearly somebody can, otherwise the assembler wouldn't know the appropriate optimizations.

Likewise, the same principle applies as you go up the hierarchy of language abstraction. Yes, you can write a parser in C that is as fast as a quick-and-dirty perl script, and many people do. But that doesn't mean that because you used C, your code will be fast. In many cases, the higher-level languages do optimizations that you may have never even considered.

3

votes

In many cases, the optimal way to perform some task may depend upon the context in which the task is performed. If a routine is written in assembly language, it will generally not be possible for the sequence of instructions to be varied based upon context. As a simple example, consider the following simple method:

inline void set_port_high(void)
{
  (*((volatile unsigned char*)0x40001204) = 0xFF);
}

A compiler for 32-bit ARM code, given the above, would likely render it as something like:

ldr  r0,=0x40001204
mov  r1,#0
strb r1,[r0]
[a fourth word somewhere holding the constant 0x40001204]

or perhaps

ldr  r0,=0x40001000  ; Some assemblers like to round pointer loads to multiples of 4096
mov  r1,#0
strb r1,[r0+0x204]
[a fourth word somewhere holding the constant 0x40001000]

That could be optimized slightly in hand-assembled code, as either:

ldr  r0,=0x400011FF
strb r0,[r0+5]
[a third word somewhere holding the constant 0x400011FF]

or

mvn  r0,#0xC0       ; Load with 0x3FFFFFFF
add  r0,r0,#0x1200  ; Add 0x1200, yielding 0x400011FF
strb r0,[r0+5]

Both of the hand-assembled approaches would require 12 bytes of code space rather than 16; the latter would replace a "load" with an "add", which would on an ARM7-TDMI execute two cycles faster. If the code was going to be executed in a context where r0 was don't-know/don't-care, the assembly language versions would thus be somewhat better than the compiled version. On the other hand, suppose the compiler knew that some register [e.g. r5] was going to hold a value that was within 2047 bytes of the desired address 0x40001204 [e.g. 0x40001000], and further knew that some other register [e.g. r7] was going to hold a value whose low bits were 0xFF. In that case, a compiler could optimize the C version of the code to simply:

strb r7,[r5+0x204]

Much shorter and faster than even the hand-optimized assembly code. Further, suppose set_port_high occurred in the context:

int temp = function1();
set_port_high();
function2(temp); // Assume temp is not used after this

Not at all implausible when coding for an embedded system. If set_port_high is written in assembly code, the compiler would have to move r0 (which holds the return value from function1) somewhere else before invoking the assembly code, and then move that value back to r0 afterward (since function2 will expect its first parameter in r0), so the "optimized" assembly code would need five instructions. Even if the compiler didn't know of any registers holding the address or the value to store, its four-instruction version (which it could adapt to use any available registers--not necessarily r0 and r1) would beat the "optimized" assembly-language version. If the compiler had the necessary address and data in r5 and r7 as described earlier, function1 would not alter those registers, and thus it could replace set_port_high with a single strb instruction--four instructions smaller and faster than the "hand-optimized" assembly code.

Note that hand-optimized assembly code can often outperform a compiler in cases where the programmer knows the precise program flow, but compilers shine in cases where a piece of code is written before its context is known, or where one piece of source code may be invoked from multiple contexts [if set_port_high is used in fifty different places in the code, the compiler could independently decide for each of those how best to expand it].

In general, I would suggest that assembly language is apt to yield the greatest performance improvements in those cases where each piece of code can be approached from a very limited number of contexts, and is apt to be detrimental to performance in places where a piece of code may be approached from many different contexts. Interestingly (and conveniently) the cases where assembly is most beneficial to performance are often those where the code is most straightforward and easy to read. The places that assembly language code would turn into a gooey mess are often those where writing in assembly would offer the smallest performance benefit.

[Minor note: there are some places where assembly code can be used to yield a hyper-optimized gooey mess; for example, one piece of code I did for the ARM needed to fetch a word from RAM and execute one of about twelve routines based upon the upper six bits of the value (many values mapped to the same routine). I think I optimized that code to something like:

ldrh  r0,[r1],#2! ; Fetch with post-increment
ldrb  r1,[r8,r0 asr #10]
sub   pc,r8,r1,asl #2

The register r8 always held the address of the main dispatch table (within the loop where the code spend 98% of its time, nothing ever used it for any other purpose); all 64 entries referred to addresses in the 256 bytes preceding it. Since the primary loop had in most cases a hard execution-time limit of about 60 cycles, the nine-cycle fetch and dispatch was very instrumental toward meeting that goal. Using a table of 256 32-bit addresses would have been one cycle faster, but would have gobbled up 1KB of very precious RAM [flash would have added more than one wait state]. Using 64 32-bit addresses would have required adding an instruction to mask off some bits from the fetched word, and would still have gobbled up 192 more bytes than the table I actually used. Using the table of 8-bit offsets yielded very compact and fast code, but not something I would expect a compiler would ever come up with; I also would not expect a compiler to dedicate a register "full time" to holding the table address.

The above code was designed to run as a self-contained system; it could periodically call C code, but only at certain times when the hardware with which it was communicating could safely be put into an "idle" state for two roughly-one-millisecond intervals every 16ms.

2

votes

In recent times, all the speed optimisations that I have done were replacing brain damaged slow code with just reasonable code. But for things were speed was really critical and I put serious effort into making something fast, the result was always an iterative process, where each iteration gave more insight into the problem, finding ways how to solve the problem with fewer operations. The final speed always depended on how much insight I got into the problem. If at any stage I used assembly code, or C code that was over-optimised, the process of finding a better solution would have suffered and the end result would be slower.

1

votes

All the answers here seem to exclude one aspect: sometimes we don't write code to achieve a specific aim, but for the sheer fun of it. It may not be economical to invest the time to do so, but arguably there is no greater satisfaction than beating the fastest compiler optimized code snippet in speed with a manually rolled asm alternative.

1

votes

C++ is faster unless you are using assembly language with deeper knowledge with the correct way.

When I code in ASM, I reorganize the instructions manually so the CPU can execute more of them in parallel when logically possible. I barely use RAM when I code in ASM for example: There could be 20000+ lines of code in ASM and I not ever once used push/pop.

You could potentially jump in the middle of the opcode to self-modify the code and the behavior without the possible penalty of self-modifying code. Accessing registers takes 1 tick(sometimes takes .25 ticks) of the CPU.Accessing the RAM could take hundreds.

For my last ASM adventure, I never once used the RAM to store a variable(for thousands of lines of ASM). ASM could be potentially unimaginably faster than C++. But it depends on a lot of variable factors such as:

1. I was writing my apps to run on the bare metal.
2. I was writing my own boot loader that was starting my programs in ASM so there was no OS management in the middle.

I am now learning C# and C++ because i realized productivity matters!! You could try to do the fastest imaginable programs using pure ASM alone in the free time. But in order to produce something, use some high level language.

For example, the last program I coded was using JS and GLSL and I never noticed any performance issue, even speaking about JS which is slow. This is because the mere concept of programming the GPU for 3D makes the speed of the language that sends the commands to the GPU almost irrelevant.

The speed of assembler alone on the bare metal is irrefutable. Could it be even slower inside C++? - It could be because you are writing assembly code with a compiler not using an assembler to start with.

My personal council is to never write assembly code if you can avoid it, even though I love assembly.

-3

votes

Assembly could be faster if your compiler generates a lot of OO support code.

Edit:

To downvoters: the OP wrote "should I ... focus on C++ and forget about assembly language?" and I stand by my answer. You always need to keep an eye on the code OO generates, particularly when using methods. Not forgetting about assembly language means that you will periodically review the assembly your OO code generates which I believe is a must for writing well-performing software.

Actually, this pertains to all compileable code, not just OO.

Is inline assembly language slower than native C++ code?

Function of assembly version:

Function of C++ version:

21 Answers