KCachegrind output for optimized vs unoptimized builds

Question

I run valgrind --tool=callgrind ./executable on the executable file generated by the following code:

#include <cstdlib>
#include <stdio.h>
using namespace std;

class XYZ{
public:
    int Count() const {return count;}
    void Count(int val){count = val;}
private:
    int count;
};

int main() {
    XYZ xyz;
    xyz.Count(10000);
    int sum = 0;
    for(int i = 0; i < xyz.Count(); i++){
//My interest is to see how the compiler optimizes the xyz.Count() call
        sum += i;
    }
    printf("Sum is %d\n", sum);
    return 0;
}

I make a debug build with the following options: -fPIC -fno-strict-aliasing -fexceptions -g -std=c++14. The release build is with the following options: -fPIC -fno-strict-aliasing -fexceptions -g -O2 -std=c++14.

Running valgrind generates two dump files. When these files (one file for debug executable, the other for release executable) are viewed in KCachegrind, the debug build is understandable as shown below:

As expected, the function XYZ::Count() const is called 10001 times. However, the optimized release build is much harder to decipher and it is not clear how many times the function is called at all. I am aware that the function call might be inlined. But how does one figure out that it has infact been inlined? The callgraph for the release build is as shown below:

There seems to be no indication of function XYZ::Count() const at all from main().

My questions are:

(1)Without looking at the assembly language code generated by the debug/release builds, and by using using KCachegrind, how can one figure out how many times a particular function, (in this case XYZ::Count() const) is called? In the release build call graph above, the function is not even called once.

(2)Is there a way to understand the callgraph and other details provided by KCachegrind for release/optimized builds? I have already looked at the KCachegrind manual available at https://docs.kde.org/trunk5/en/kdesdk/kcachegrind/kcachegrind.pdf, but I was wondering if there are some useful hacks/rules of thumb that one should look for in release builds.

ead ead · Accepted Answer · 2018-02-18T17:18:05

The output of valgrind is easy to understand: As valgrind+kcachegrind are telling you, this function was not called at all in the release build.

The question is, what do you mean by called? If a function is inlined, is it still "called"? Actually, the situation is more complex, as it seems at the first sight and your example isn't that trivial.

Was Count() inlined in the release build? Sure, kind of. The code transformation during the optimization is often quite remarkable, like in your case - and the best way to judge, is to look into the resulting assembler (here for clang):

main:                                   # @main
        pushq   %rax
        leaq    .L.str(%rip), %rdi
        movl    $49995000, %esi         # imm = 0x2FADCF8
        xorl    %eax, %eax
        callq   printf@PLT
        xorl    %eax, %eax
        popq    %rcx
        retq
.L.str:
        .asciz  "Sum is %d\n"

You can see, that the main doesn't execute the for-loop at all, but just prints the result (49995000), which is calculated during the optimization because the number of iterations is known during the compile-time.

So was Count() inlined? Yes, somewhere during the first steps of optimization, but then the code became something completely different - there is no place where Count() was inlined in the final assembler.

So what happens, when we "hide" the number of iteration from compiler? E.g. pass it via the command line:

...
int main(int argc,  char* argv[]) {
   XYZ xyz;
   xyz.Count(atoi(argv[1]));
...

In the resulting assembler, we still don't encounter a for-loop, because the optimizer can figure out, that the call of Count() doesn't have side-effect and optimizes the whole thing out:

main:                                   # @main
        pushq   %rbx
        movq    8(%rsi), %rdi
        xorl    %ebx, %ebx
        xorl    %esi, %esi
        movl    $10, %edx
        callq   strtol@PLT
        testl   %eax, %eax
        jle     .LBB0_2
        leal    -1(%rax), %ecx
        leal    -2(%rax), %edx
        imulq   %rcx, %rdx
        shrq    %rdx
        leal    -1(%rax,%rdx), %ebx
.LBB0_2:
        leaq    .L.str(%rip), %rdi
        xorl    %eax, %eax
        movl    %ebx, %esi
        callq   printf@PLT
        xorl    %eax, %eax
        popq    %rbx
        retq
.L.str:
        .asciz  "Sum is %d\n"

The optimizer came up with the formula (n-1)*(n-2)/2 for the sum i=0..n-1!

Let's now hide the definition of Count() in an separate translation unit class.cpp, so the optimizer cannot see it's definition:

class XYZ{
public:
    int Count() const;//definition in separate translation unit
...

Now we get our for-loop and a call to Count() in every iteration, the most important part of the assembler is:

.L6:
        addl    %ebx, %ebp
        addl    $1, %ebx
.L3:
        movq    %r12, %rdi
        call    XYZ::Count() const@PLT
        cmpl    %eax, %ebx
        jl      .L6

The result of the Count() (in %rax) is compared to the current counter (in %ebx) in every iteration step. Now, if we run it with valgrind we can see in the list of callees, that XYZ::Count() was called 10001 times.

However, for modern tool-chains it is not enough to see the assembler of the single translation units - there is a thing called link-time-optimization. We can use it by building somewhere along these lines:

gcc -fPIC -g -O2 -flto -o class.o -c class.cpp
gcc -fPIC -g -O2 -flto -o test.o  -c test.cpp
gcc -g -O2 -flto -o test_r class.o test.o

And running the resulting executable with valgrind we once again see, that Count() was not called!

However looking into the machine code (here I used gcc, my clang-installation seems to have an issue with lto):

00000000004004a0 <main>:
  4004a0:   48 83 ec 08             sub    $0x8,%rsp
  4004a4:   48 8b 7e 08             mov    0x8(%rsi),%rdi
  4004a8:   ba 0a 00 00 00          mov    $0xa,%edx
  4004ad:   31 f6                   xor    %esi,%esi
  4004af:   e8 bc ff ff ff          callq  400470 <strtol@plt>
  4004b4:   85 c0                   test   %eax,%eax
  4004b6:   7e 2b                   jle    4004e3 <main+0x43>
  4004b8:   89 c1                   mov    %eax,%ecx
  4004ba:   31 d2                   xor    %edx,%edx
  4004bc:   31 c0                   xor    %eax,%eax
  4004be:   66 90                   xchg   %ax,%ax
  4004c0:   01 c2                   add    %eax,%edx
  4004c2:   83 c0 01                add    $0x1,%eax
  4004c5:   39 c8                   cmp    %ecx,%eax
  4004c7:   75 f7                   jne    4004c0 <main+0x20>
  4004c9:   48 8d 35 a4 01 00 00    lea    0x1a4(%rip),%rsi        # 400674 <_IO_stdin_used+0x4>
  4004d0:   bf 01 00 00 00          mov    $0x1,%edi
  4004d5:   31 c0                   xor    %eax,%eax
  4004d7:   e8 a4 ff ff ff          callq  400480 <__printf_chk@plt>
  4004dc:   31 c0                   xor    %eax,%eax
  4004de:   48 83 c4 08             add    $0x8,%rsp
  4004e2:   c3                      retq   
  4004e3:   31 d2                   xor    %edx,%edx
  4004e5:   eb e2                   jmp    4004c9 <main+0x29>
  4004e7:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)

We can see, that the call to the function Count() was inlined but - there is still a for-loop (I guess this is a gcc vs clang thing).

But what is of most interest to you: the function Count() is "called" only once - its value is saved to register %ecx and the loop is actually only:

  4004c0:   01 c2                   add    %eax,%edx
  4004c2:   83 c0 01                add    $0x1,%eax
  4004c5:   39 c8                   cmp    %ecx,%eax
  4004c7:   75 f7                   jne    4004c0 <main+0x20>

This all you could also see with help of Kcachegrid, if valgrind were run with option `--dump-instr=yes.

KCachegrind output for optimized vs unoptimized builds

2 Answers