Quick way to count number of instructions executed in a C program

Question

Is there an easy way to quickly count the number of instructions executed (x86 instructions - which and how many each) while executing a C program ?

I use gcc version 4.7.1 (GCC) on a x86_64 GNU/Linux machine.

I agree with Doness' answer that typically people want to profile execution time per function. However, if you really want to get exact counts of each instruction executed, then you need to run your code on an instruction set simulator, such as simplescalar.com — TJD
Can you elaborate on what you are trying to accomplish? On x86, instruction execution performance depends far, far more on context than it does on the actual instruction -- virtually all instructions can optionally be loads or stores, for example. And purely register-to-register instructions are going to depend in complex ways on the pipeline state on modern CPUs. This doesn't sound like useful information to me. — Andy Ross
Why do you ask? Usually profiling means something different... Eg compile with gcc -pg -Wall -O and use gprof or perhaps oprofile !! — Basile Starynkevitch
I am implementing a complex mathematical algorithm and I wanted to count the number of multiplications(and divisions) which happens during its execution.I was looking for an easy way other than looking at the high level code and inferring the numbers.Maybe I should use a custom multiply function and insert a counter in it. — Jean
I'm not sure I believe "zero wait memory", even L1 cache on modern CPUs is 4 cycles! But regardless: looks to tricks like building your app in C++ using a custom operator*() implementation. Note that on modern compilers even "multiplication" may not be implemented in an easy to detect way (consider the classic tricks played with the LEA instruction). — Andy Ross

score 5 · Accepted Answer · 2020-11-16T18:10:10

Linux perf_event_open system call with config = PERF_COUNT_HW_INSTRUCTIONS

This Linux system call appears to be a cross architecture wrapper for performance events, including both hardware performance counters from the CPU and software events from the kernel.

Here's an example adapted from the man perf_event_open page:

perf_event_open.c

#define _GNU_SOURCE
#include <asm/unistd.h>
#include <linux/perf_event.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/ioctl.h>
#include <unistd.h>

#include <inttypes.h>
#include <sys/types.h>

static long
perf_event_open(struct perf_event_attr *hw_event, pid_t pid,
                int cpu, int group_fd, unsigned long flags)
{
    int ret;

    ret = syscall(__NR_perf_event_open, hw_event, pid, cpu,
                    group_fd, flags);
    return ret;
}

int
main(int argc, char **argv)
{
    struct perf_event_attr pe;
    long long count;
    int fd;

    uint64_t n;
    if (argc > 1) {
        n = strtoll(argv[1], NULL, 0);
    } else {
        n = 10000;
    }

    memset(&pe, 0, sizeof(struct perf_event_attr));
    pe.type = PERF_TYPE_HARDWARE;
    pe.size = sizeof(struct perf_event_attr);
    pe.config = PERF_COUNT_HW_INSTRUCTIONS;
    pe.disabled = 1;
    pe.exclude_kernel = 1;
    // Don't count hypervisor events.
    pe.exclude_hv = 1;

    fd = perf_event_open(&pe, 0, -1, -1, 0);
    if (fd == -1) {
        fprintf(stderr, "Error opening leader %llx\n", pe.config);
        exit(EXIT_FAILURE);
    }

    ioctl(fd, PERF_EVENT_IOC_RESET, 0);
    ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);

    /* Loop n times, should be good enough for -O0. */
    __asm__ (
        "1:;\n"
        "sub $1, %[n];\n"
        "jne 1b;\n"
        : [n] "+r" (n)
        :
        :
    );

    ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
    read(fd, &count, sizeof(long long));

    printf("Used %lld instructions\n", count);

    close(fd);
}

Compile and run:

g++ -ggdb3 -O0 -std=c++11 -Wall -Wextra -pedantic -o perf_event_open.out perf_event_open.c
./perf_event_open.out

Output:

Used 20016 instructions

So we see that the result is pretty close to the expected value of 20000: 10k * two instructions per loop in the __asm__ block (sub, jne).

If I vary the argument, even to low values such as 100:

./perf_event_open.out 100

it gives:

Used 216 instructions

maintaining that constant + 16 instructions, so it seems that accuracy is pretty high, those 16 must be just the ioctl setup instructions after our little loop.

Now you might also be interested in:

prevent reordering of the syscalls: Enforcing statement order in C++
prevent the test loop from being optimized out: How to prevent GCC from optimizing out a busy wait loop?

Other events of interest that can be measured by this system call:

cycle counts: How to get the CPU cycle count in x86_64 from C++?

Tested on Ubuntu 20.04 amd64, GCC 9.3.0, Linux kernel 5.4.0, Intel Core i7-7820HQ CPU.

Quick way to count number of instructions executed in a C program

5 Answers

Intel Pin's `instcount`

Quick way to count number of instructions executed in a C program

5 Answers

Intel Pin's instcount

Intel Pin's `instcount`