9
votes

Is there an easy way to quickly count the number of instructions executed (x86 instructions - which and how many each) while executing a C program ?

I use gcc version 4.7.1 (GCC) on a x86_64 GNU/Linux machine.

5
I agree with Doness' answer that typically people want to profile execution time per function. However, if you really want to get exact counts of each instruction executed, then you need to run your code on an instruction set simulator, such as simplescalar.comTJD
Can you elaborate on what you are trying to accomplish? On x86, instruction execution performance depends far, far more on context than it does on the actual instruction -- virtually all instructions can optionally be loads or stores, for example. And purely register-to-register instructions are going to depend in complex ways on the pipeline state on modern CPUs. This doesn't sound like useful information to me.Andy Ross
Why do you ask? Usually profiling means something different... Eg compile with gcc -pg -Wall -O and use gprof or perhaps oprofile !!Basile Starynkevitch
I am implementing a complex mathematical algorithm and I wanted to count the number of multiplications(and divisions) which happens during its execution.I was looking for an easy way other than looking at the high level code and inferring the numbers.Maybe I should use a custom multiply function and insert a counter in it.Jean
I'm not sure I believe "zero wait memory", even L1 cache on modern CPUs is 4 cycles! But regardless: looks to tricks like building your app in C++ using a custom operator*() implementation. Note that on modern compilers even "multiplication" may not be implemented in an easy to detect way (consider the classic tricks played with the LEA instruction).Andy Ross

5 Answers

5
votes

Linux perf_event_open system call with config = PERF_COUNT_HW_INSTRUCTIONS

This Linux system call appears to be a cross architecture wrapper for performance events, including both hardware performance counters from the CPU and software events from the kernel.

Here's an example adapted from the man perf_event_open page:

perf_event_open.c

#define _GNU_SOURCE
#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>

#include <inttypes.h>
#include <sys/types.h>

static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
                int cpu, int group_fd, unsigned long flags)
{
    int ret;

    ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
                    group_fd, flags);
    return ret;
}

int
main(int argc, char **argv)
{
    struct perf_event_attr pe;
    long long count;
    int fd;

    uint64_t n;
    if (argc > 1) {
        n = strtoll(argv[1], NULL, 0);
    } else {
        n = 10000;
    }

    memset(&pe, 0, sizeof(struct perf_event_attr));
    pe.type = PERF_TYPE_HARDWARE;
    pe.size = sizeof(struct perf_event_attr);
    pe.config = PERF_COUNT_HW_INSTRUCTIONS;
    pe.disabled = 1;
    pe.exclude_kernel = 1;
    // Don't count hypervisor events.
    pe.exclude_hv = 1;

    fd = perf_event_open(&pe, 0, -1, -1, 0);
    if (fd == -1) {
        fprintf(stderr, "Error opening leader %llx\n", pe.config);
        exit(EXIT_FAILURE);
    }

    ioctl(fd, PERF_EVENT_IOC_RESET, 0);
    ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);

    /* Loop n times, should be good enough for -O0. */
    __asm__ (
        "1:;\n"
        "sub $1, %[n];\n"
        "jne 1b;\n"
        : [n] "+r" (n)
        :
        :
    );

    ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
    read(fd, &count, sizeof(long long));

    printf("Used %lld instructions\n", count);

    close(fd);
}

Compile and run:

g++ -ggdb3 -O0 -std=c++11 -Wall -Wextra -pedantic -o perf_event_open.out perf_event_open.c
./perf_event_open.out

Output:

Used 20016 instructions

So we see that the result is pretty close to the expected value of 20000: 10k * two instructions per loop in the __asm__ block (sub, jne).

If I vary the argument, even to low values such as 100:

./perf_event_open.out 100

it gives:

Used 216 instructions

maintaining that constant + 16 instructions, so it seems that accuracy is pretty high, those 16 must be just the ioctl setup instructions after our little loop.

Now you might also be interested in:

Other events of interest that can be measured by this system call:

Tested on Ubuntu 20.04 amd64, GCC 9.3.0, Linux kernel 5.4.0, Intel Core i7-7820HQ CPU.

2
votes

Probably a duplicate of this question

I say probably because you asked for the assembler instructions, but that question handles the C-level profiling of code.

My question to you would be, however: why would you want to profile the actual machine instructions executed? As a very first issue, this would differ between various compilers, and their optimization settings. As a more practical issue, what could you actually DO with that information? If you are in the process of searching for/optimizing bottlenecks, the code profiler is what you are looking for.

I might miss something important here, though.

2
votes

You can easily count the number of executed instruction using Hardware Performance Counter (HPC). In order to access the HPC, you need an interface to it. I recommended you to use PAPI Performance API.

2
votes

Intel Pin's instcount

You can use the Binary Instrumentation tool 'Pin' by Intel. I would avoid using a simulator (they are often extremely slow). Pin does most of the stuff you can do with a simulator without recompiling the binary and at a normal execution like speed (depends on the pin tool you are using).

To count the number of instructions with Pin:

  1. Download the latest (or 3.10 if this answer gets old) pin kit from here.
  2. Extract everything and go to the directory: cd pin-root/source/tools/ManualExample/
  3. Make all the tools in the directory: make all
  4. Run the tool called inscount0.so using the command: ../../../pin -t obj-intel64/inscount0.so -- your-binary-here
  5. Get the instruction count in the file inscount.out, cat inscount.out.

The output would be something like:

➜ ../../../pin -t obj-intel64/inscount0.so -- /bin/ls
buffer_linux.cpp       itrace.cpp
buffer_windows.cpp     little_malloc.c
countreps.cpp          makefile
detach.cpp         makefile.rules
divide_by_zero_unix.c  malloc_mt.cpp
isampling.cpp          w_malloctrace.cpp
➜ cat inscount.out
Count 716372

1
votes

Although not "quick" depending on the program, this may have been answered in this question. Here, Mark Plotnick suggests to use gdb to watch your program counter register changes:

# instructioncount.gdb
set pagination off
set $count=0
while ($pc != 0xyourstoppingaddress)
    stepi
    set $count++
end
print $count
quit

Then, start gdb on your program:

gdb --batch --command instructioncount.gdb --args ./yourexecutable with its arguments

To get the end address 0xyourstoppingaddress, you can use the following script:

# stopaddress.gdb
break main
run
info frame
quit

which puts a breakpoint on the function main, and gives:

$ gdb --batch --command stopaddress.gdb --args ./yourexecutable with its arguments
...
Stack level 0, frame at 0x7fffffffdf70:
 rip = 0x40089d in main (main_aes.c:33); saved rip 0x7ffff7a66d20
 source language c.
 Arglist at 0x7fffffffdf60, args: argc=3, argv=0x7fffffffe048
...

Here what is important is the saved rip 0x7ffff7a66d20 part. On my CPU, rip is the instruction pointer, and the saved rip is the "return address", as stated by pepero in this answer.

So in this case, the stopping address is 0x7ffff7a66d20, which is the return address of the main function. That is, the end of the program execution.