Why is function call taking significantly longer than code executed according to gprof?

Question

I'm learning how to profile my code with gprof. For one of my applications, I have the following output:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 10.27      1.20     1.20                             Location::get_type() const (location.cpp:20 @ 40a4bd)

Farther down I see this

  1.20      4.98     0.14 34662692     0.00     0.00  Location::get_type() const (location.cpp:19 @ 40a4ac)

Here is the function

char Location::get_type() const {
    return type;
}

I'm assuming the first line by gprof refers to the total time the function needs to execute, while the second line refers to just the time needed by the return statement. I have other functions that are getters for the same class that return ints, but the difference between the function time and return statement time is only about 0.1 seconds, where as with the times I posted the time difference is 1.06 seconds (the other getters are called about about 2 million times less, which is small compared to the total number of calls). What could explain the higher times for the function call as opposed to the one line of code in it?

It might be worth mentioning that I compiled with -g -pg since I'm using gprof in line-by-line mode.

Edit: One of the answers suggested I look at the assembly output. I can't understand it, so I'll post it here. I've posted the assembly code for two function calls. The first one is get_floor(), which is relatively fast (~.10 seconds). The second one is get_type() which is slow.

_ZNK8Location9get_floorEv:
.LFB5:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        movq    %rdi, -8(%rbp)
        movq    -8(%rbp), %rax
        movl    8(%rax), %eax
        popq    %rbp
        .cfi_def_cfa 7, 8
        ret
        .cfi_endproc
.LFE5:
        .size   _ZNK8Location9get_floorEv, .-_ZNK8Location9get_floorEv
        .align 2
        .globl  _ZNK8Location8get_typeEv
        .type   _ZNK8Location8get_typeEv, @function
_ZNK8Location8get_typeEv:
.LFB6:
        .cfi_startproc
        pushq   %rbp
        .cfi_def_cfa_offset 16
        .cfi_offset 6, -16
        movq    %rsp, %rbp
        .cfi_def_cfa_register 6
        movq    %rdi, -8(%rbp)
        movq    -8(%rbp), %rax
        movzbl  12(%rax), %eax
        popq    %rbp
        .cfi_def_cfa 7, 8
        ret
        .cfi_endproc

One possibility is that there's a granularity problem; the profiling mechanism runs for 0.01 (1/100th) second, and then checks where the program counter is, and assumes your in the function. However, if the functions are small, I believe you can end up with the wrong function being credited because the profile treats a stretch of 16 or 32 bytes of the machine code as all being in the same function. — Jonathan Leffler
You're getting good answers, but I'm curious to know what makes you focus on gprof. Did a teacher recommend it? Did a blog or a book or some documentation recommend it? It has a number of issues. — Mike Dunlavey
@MikeDunlavey It was one of the first that I saw, and upvoted fairly high here: stackoverflow.com/questions/375913/…. I didn't realize it had so many issues, I'll try out another profiler and come back to this question. — gsingh2011

6502 6502 · Accepted Answer · 2012-09-28T05:57:28

Profilers based on code instrumentation are almost useless for this kind of very fast functions and anything you can get out of them is just random numbers.

The reason is that execution time on modern CPUs is a very complex function of a big number of parameters and instrumentation (code added to the function to update statistics) is going to mess everything up. For short functions like this one the instrumentation alone is probably much more than the code being instrumented itself.

For profiling fast function you should use passive profilers like OProfile that just run the program and check out at quasi-regular intervals where it is. You don't get exact counters, but the approximation you can get from random sampling is much closer to the real thing because the program behavior has not been altered by much.

Another option to see where a program is wasting time is just by experimenting. If you suspect that a certain part of the code is the bottleneck then try executing it 10 times (this is often not hard to do) and check how much the total execution time of the program changes. If the slowdown is not big then you know that even being able to completeley remove that code wouldn't save you much.

Why is function call taking significantly longer than code executed according to gprof?

3 Answers