44
votes

Why does commenting out the first two lines of this for loop and uncommenting the third result in a 42% speedup?

int count = 0;
for (uint i = 0; i < 1000000000; ++i) {
    var isMultipleOf16 = i % 16 == 0;
    count += isMultipleOf16 ? 1 : 0;
    //count += i % 16 == 0 ? 1 : 0;
}

Behind the timing is vastly different assembly code: 13 vs. 7 instructions in the loop. The platform is Windows 7 running .NET 4.0 x64. Code optimization is enabled, and the test app was run outside VS2010. [Update: Repro project, useful for verifying project settings.]

Eliminating the intermediate boolean is a fundamental optimization, one of the simplest in my 1980's era Dragon Book. How did the optimization not get applied when generating the CIL or JITing the x64 machine code?

Is there a "Really compiler, I would like you to optimize this code, please" switch? While I sympathize with the sentiment that premature optimization is akin to the love of money, I could see the frustration in trying to profile a complex algorithm that had problems like this scattered throughout its routines. You'd work through the hotspots but have no hint of the broader warm region that could be vastly improved by hand tweaking what we normally take for granted from the compiler. I sure hope I'm missing something here.

Update: Speed differences also occur for x86, but depend on the order that methods are just-in-time compiled. See Why does JIT order affect performance?

Assembly code (as requested):

    var isMultipleOf16 = i % 16 == 0;
00000037  mov         eax,edx 
00000039  and         eax,0Fh 
0000003c  xor         ecx,ecx 
0000003e  test        eax,eax 
00000040  sete        cl 
    count += isMultipleOf16 ? 1 : 0;
00000043  movzx       eax,cl 
00000046  test        eax,eax 
00000048  jne         0000000000000050 
0000004a  xor         eax,eax 
0000004c  jmp         0000000000000055 
0000004e  xchg        ax,ax 
00000050  mov         eax,1 
00000055  lea         r8d,[rbx+rax] 
    count += i % 16 == 0 ? 1 : 0;
00000037  mov         eax,ecx 
00000039  and         eax,0Fh 
0000003c  je          0000000000000042 
0000003e  xor         eax,eax 
00000040  jmp         0000000000000047 
00000042  mov         eax,1 
00000047  lea         edx,[rbx+rax] 
5
I'd be curious to see the different assembly code. Could you post it?phoog
have you tested bool isMultipleOf16 = ...?David.Chu.ca
@David.Chu.ca - that wouldn't make a difference - var is "compiler, please infer the type of this variable, and pretend I wrote that instead". In this case, it will have inferred bool for itself.Damien_The_Unbeliever
@EdwardBrey: Since you did this in Debug mode all bets are offBrokenGlass
@EdwardBrey: I can't find a source at the moment, but I believe the jitter and/or other optimizer settings are different if you have a debugger attached at all (that is, if you're running from Visual Studio, even if you compiled in "Release" mode). Try running your code from the command line (not from VS) and see what happens.Daniel Pryden

5 Answers

9
votes

Question should be "Why do I see such a difference on my machine?". I cannot reproduce such a huge speed difference and suspect there is something specific to your environment. Very difficult to tell what it can be though. Can be some (compiler) options you have set some time ago and forgot about them.

I have create a console application, rebuild in Release mode (x86) and run outside VS. Results are virtually identical, 1.77 seconds for both methods. Here is the exact code:

static void Main(string[] args)
{
    Stopwatch sw = new Stopwatch();
    sw.Start();
    int count = 0;

    for (uint i = 0; i < 1000000000; ++i)
    {
        // 1st method
        var isMultipleOf16 = i % 16 == 0;
        count += isMultipleOf16 ? 1 : 0;

        // 2nd method
        //count += i % 16 == 0 ? 1 : 0;
    }

    sw.Stop();
    Console.WriteLine(string.Format("Ellapsed {0}, count {1}", sw.Elapsed, count));
    Console.ReadKey();
}

Please, anyone who has 5 minutes copy the code, rebuild, run outside VS and post results in comments to this answer. I'd like to avoid saying "it works on my machine".

EDIT

To be sure I have created a 64 bit Winforms application and the results are similar as in the the question - the first method is slower (1.57 sec) than the second one (1.05 sec). The difference I observe is 33% - still a lot. Seems there is a bug in .NET4 64 bit JIT compiler.

4
votes

I can't speak to the .NET compiler, or its optimizations, or even WHEN it performs its optimizations.

But in this specific case, if the compiler folded that boolean variable in to the actual statement, and you were to try and debug this code, the optimized code would not match the code as written. You would not be able to single step over the isMulitpleOf16 assignment and check it value.

Thats just one example of where the optimization may well be turned off. There could be others. The optimization may happen during the load phase of the code, rather than the code generation phase from the CLR.

The modern runtimes are pretty complicated, especially if you throw in JIT and dynamic optimization over run time. I feel grateful the code does what it says at all sometimes.

3
votes

It's a bug in the .NET Framework.

Well, really I'm just speculating, but I submitted a bug report on Microsoft Connect to see what they say. After Microsoft deleted that report, I resubmitted it on roslyn project on GitHub.

Update: Microsoft has moved the issue to the coreclr project. From the comments on the issue, calling it a bug seems a bit strong; it's more of a missing optimization.

2
votes

I think this is related to your other question. When I change your code as follows, the multi-line version wins.

oops, only on x86. On x64, multi-line is the slowest and the conditional beats them both handily.

class Program
{
    static void Main()
    {
        ConditionalTest();
        SingleLineTest();
        MultiLineTest();
        ConditionalTest();
        SingleLineTest();
        MultiLineTest();
        ConditionalTest();
        SingleLineTest();
        MultiLineTest();
    }

    public static void ConditionalTest()
    {
        Stopwatch stopwatch = new Stopwatch();
        stopwatch.Start();
        int count = 0;
        for (uint i = 0; i < 1000000000; ++i) {
            if (i % 16 == 0) ++count;
        }
        stopwatch.Stop();
        Console.WriteLine("Conditional test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
    }

    public static void SingleLineTest()
    {
        Stopwatch stopwatch = new Stopwatch();
        stopwatch.Start();
        int count = 0;
        for (uint i = 0; i < 1000000000; ++i) {
            count += i % 16 == 0 ? 1 : 0;
        }
        stopwatch.Stop();
        Console.WriteLine("Single-line test --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
    }

    public static void MultiLineTest()
    {
        Stopwatch stopwatch = new Stopwatch();
        stopwatch.Start();
        int count = 0;
        for (uint i = 0; i < 1000000000; ++i) {
            var isMultipleOf16 = i % 16 == 0;
            count += isMultipleOf16 ? 1 : 0;
        }
        stopwatch.Stop();
        Console.WriteLine("Multi-line test  --> Count: {0}, Time: {1}", count, stopwatch.ElapsedMilliseconds);
    }
}
1
votes

I tend to think of it like this: the people who work on the compiler can only do so much stuff per year. If in that time they could implement lambdas or lots of classical optimizations, I'd vote for lambdas. C# is a language that's efficient in terms of code reading and writing effort, rather than in terms of execution time.

So it's reasonable for the team to concentrate on features that maximize the reading/writing efficiency, rather than the execution efficiency in a certain corner case (of which there are probably thousands).

Initially, I believe, the idea was that the JITter would do all the optimization. Unfortunately the JITting takes noticeable amounts of time, and any advanced optimizations will make it worse. So that didn't work out as well as one might have hoped.

One thing I found about programming really fast code in C# is that quite often you hit a severe GC bottleneck before any optimization like you mention would make a difference. Like if you allocate millions of objects. C# leaves you very little in terms of avoiding the cost: you can use arrays of structs instead, but the resulting code is really ugly in comparison. My point being that many other decisions about C# and .NET make such specific optimizations less worthwhile than they would be in something like a C++ compiler. Heck, they even dropped the CPU-specific optimizations in NGEN, trading performance for programmer (debugger) efficiency.

Having said all this, I'd love C# that actually made use of optimizations that C++ made use of since the 1990s. Just not at the expense of features like, say, async/await.