1
votes

I have a C program that calls a function pi_calcPiItem() 600000000 times through the function pi_calcPiBlock. So to analyze the time spent in the functions I used GNU gprof. The result seems to be erroneous since all calls are attributed to main() instead. Furthermore the call graph does not make any sense:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  Ts/call  Ts/call  name
 61.29      9.28     9.28                             pi_calcPiItem
 15.85     11.68     2.40                             pi_calcPiBlock
 11.96     13.49     1.81                             _mcount_private
  9.45     14.92     1.43                             __fentry__
  1.45     15.14     0.22                             pow
  0.00     15.14     0.00 600000000     0.00     0.00  main

                        Call graph


granularity: each sample hit covers 4 byte(s) for 0.07% of 15.14 seconds

index % time    self  children    called     name
                                                 <spontaneous>
[1]     61.3    9.28    0.00                 pi_calcPiItem [1]
-----------------------------------------------
                                                 <spontaneous>
[2]     15.9    2.40    0.00                 pi_calcPiBlock [2]
                0.00    0.00 600000000/600000000     main [6]
-----------------------------------------------
                                                 <spontaneous>
[3]     12.0    1.81    0.00                 _mcount_private [3]
-----------------------------------------------
                                                 <spontaneous>
[4]      9.4    1.43    0.00                 __fentry__ [4]
-----------------------------------------------
                                                 <spontaneous>
[5]      1.5    0.22    0.00                 pow [5]
-----------------------------------------------
                                   6             main [6]
                0.00    0.00 600000000/600000000     pi_calcPiBlock [2]
[6]      0.0    0.00    0.00 600000000+6       main [6]
                                   6             main [6]
-----------------------------------------------

Is this a bug or do I have to configure the program somehow?

And what does <spontaneous> mean?

EDIT (more insight for you)

The code is all about the calculation of pi:

#define PI_BLOCKSIZE (100000000)
#define PI_BLOCKCOUNT (6)
#define PI_THRESHOLD (PI_BLOCKSIZE * PI_BLOCKCOUNT)

int32_t main(int32_t argc, char* argv[]) {
  double result;

  for ( int32_t i = 0; i < PI_THRESHOLD; i += PI_BLOCKSIZE ) {
    pi_calcPiBlock(&result, i, i + PI_BLOCKSIZE);
  }

  printf("pi = %f\n",result);
  return 0;
}

static void pi_calcPiBlock(double* result, int32_t start, int32_t end) {
  double piItem;

  for ( int32_t i = start; i < end; ++i ) {
    pi_calcPiItem(&piItem, i);
    *result += piItem;
  }  
}    

static void pi_calcPiItem(double* piItem, int32_t index) {
  *piItem = 4.0 * (pow(-1.0,index) / (2.0 * index + 1.0));
}

And this is how I got the results (executed on Windows with the help of Cygwin):

> gcc -std=c99 -o pi *.c -pg -fno-inline-small-functions
> ./pi.exe
> gprof.exe pi.exe
2
Code? What command line are you using to look at your results?CDahn
@CDahn I just added what you were asking for.Bastian
Function inlining is a standard C compiler optimization. You cannot get function call counts for functions that no longer exist.Hans Passant
@HansPassant Therefore I just used the -fno-inline-small-functions parameter to avoid inlining (and, hence, edited the description above). But: nothing changed in the output of gprof.Bastian
Double-check your assumptions with -S. There does get to be a point where tinkering with optimizer settings starts to produce useless profile results. You're getting well past it by disabling inlining.Hans Passant

2 Answers

2
votes

Try:

  1. Using the noinline, noclone function attributes instead of -fno-inline-small-functions
    • By disassembling main I could see that -fno-inline-small-functions doesn't stop inlining
  2. Linking your program statically (-static)
  3. You should also initialize result to 0.0 in main

This worked for me on Linux, x86-64:

#include <stdio.h>
#include <stdint.h>
#include <math.h>

#define PI_BLOCKSIZE (100000000)
#define PI_BLOCKCOUNT (6)
#define PI_THRESHOLD (PI_BLOCKSIZE * PI_BLOCKCOUNT)

static void pi_calcPiItem(double* piItem, int32_t index);
static void pi_calcPiBlock(double* result, int32_t start, int32_t end);

int32_t main(int32_t argc, char* argv[]) {
  double result;

  result = 0.0;
  for ( int32_t i = 0; i < PI_THRESHOLD; i += PI_BLOCKSIZE ) {
    pi_calcPiBlock(&result, i, i + PI_BLOCKSIZE);
  }

  printf("pi = %f\n",result);
  return 0;
}

__attribute__((noinline, noclone))
static void pi_calcPiBlock(double* result, int32_t start, int32_t end) {
  double piItem;

  for ( int32_t i = start; i < end; ++i ) {
    pi_calcPiItem(&piItem, i);
    *result += piItem;
  }  
}    

__attribute__((noinline, noclone))
static void pi_calcPiItem(double* piItem, int32_t index) {
  *piItem = 4.0 * (pow(-1.0,index) / (2.0 * index + 1.0));
}

Building the Code

$ cc pi.c -o pi -Os -Wall -g3 -I. -std=c99 -pg -static -lm

Output

$ ./pi && gprof ./pi
pi = 3.141593
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ns/call  ns/call  name    
 85.61     22.55    22.55                             __ieee754_pow_sse2
  4.75     23.80     1.25                             pow
  4.14     24.89     1.09 600000000     1.82     1.82  pi_calcPiItem
  2.54     25.56     0.67                             __exp1
  0.91     25.80     0.24                             pi_calcPiBlock
  0.53     25.94     0.14                             matherr
  0.47     26.07     0.13                             __lseek_nocancel
  0.38     26.17     0.10                             frame_dummy
  0.34     26.26     0.09                             __ieee754_exp_sse2
  0.32     26.34     0.09                             __profile_frequency
  0.00     26.34     0.00        1     0.00     0.00  main


             Call graph (explanation follows)


granularity: each sample hit covers 2 byte(s) for 0.04% of 26.34 seconds

index % time    self  children    called     name
                                                 <spontaneous>
[1]     85.6   22.55    0.00                 __ieee754_pow_sse2 [1]
-----------------------------------------------
                                                 <spontaneous>
[2]      5.0    0.24    1.09                 pi_calcPiBlock [2]
                1.09    0.00 600000000/600000000     pi_calcPiItem [4]
-----------------------------------------------
                                                 <spontaneous>
[3]      4.7    1.25    0.00                 pow [3]
-----------------------------------------------
                1.09    0.00 600000000/600000000     pi_calcPiBlock [2]
[4]      4.1    1.09    0.00 600000000         pi_calcPiItem [4]
-----------------------------------------------
                                                 <spontaneous>
[5]      2.5    0.67    0.00                 __exp1 [5]
-----------------------------------------------
                                                 <spontaneous>
[6]      0.5    0.14    0.00                 matherr [6]
-----------------------------------------------
                                                 <spontaneous>
[7]      0.5    0.13    0.00                 __lseek_nocancel [7]
-----------------------------------------------
                                                 <spontaneous>
[8]      0.4    0.10    0.00                 frame_dummy [8]
-----------------------------------------------
                                                 <spontaneous>
[9]      0.3    0.09    0.00                 __ieee754_exp_sse2 [9]
-----------------------------------------------
                                                 <spontaneous>
[10]     0.3    0.09    0.00                 __profile_frequency [10]
-----------------------------------------------
                0.00    0.00       1/1           __libc_start_main [827]
[11]     0.0    0.00    0.00       1         main [11]
-----------------------------------------------

Comments

As expected, pow() is the bottleneck. While pi is running, perf top (sampling based system profiler) also shows __ieee754_pow_sse2 taking 60%+ of CPU. Changing pow(-1.0,index) to ((i & 1) ? -1.0 : 1.0) as @Mike Dunlavey suggested makes the code roughly 4 times faster.

0
votes

In 'man gprof' page, here is explanation for "spontaneous":

Parents that are not themselves profiled will have the time of their profiled children propagated to them, but they will appear to be spontaneously invoked in the call graph listing, and will not have their time propagated further. Similarly, signal catchers, even though profiled, will appear to be spontaneous (although for more obscure reasons). Any profiled children of signal catchers should have their times propagated properly, unless the signal catcher was invoked during the execution of the profiling routine, in which case all is lost.