3
votes

How to count the number of CPU clock cycles between the start and end of a benchmark in gem5?

I'm interested in all of the following cases:

  • full system userland benchmark. Maybe the m5 guest tool has a way to do it?

  • bare metal benchmark. When gem5 exits it dumps the stats automatically, so the main question is how to skip the cycles for bootloader and go straight to the benchmark itself.

    Is there a way besides modifying the benchmark source with instrumentation instructions? How to write those instrumentation instructions in detail?

  • syscall emulation benchmark. I think gem5 just outputs the stats.txt at the end of the run, and then you ca just grep system.cpu.numCycles, but I have to confirm it, currently blocked on: How to solve "FATAL: kernel too old" when running gem5 in syscall emulation SE mode?

I want to use this to learn:

  • learn how CPUs work
  • how to optimize assembly code or compiler settings to run optimally on a given CPU
1

1 Answers

2
votes

m5 tool

A good approximation is to run, ideally from a shell script that is the /init program:

m5 resetstats
run-benchmark
m5 dumpstats

Then on host:

grep -E '^system.cpu.numCycles ' m5out/stats.txt

Gives something like:

system.cpu.numCycles                      33942872680                       # number of cpu cycles simulated

Note that if you replay from a m5 checkpoint with a different CPU, e.g.:

--restore-with-cpu=HPI --caches

then you need to grep for a different identifier:

grep -E '^system.switch_cpus.numCycles ' m5out/stats.txt

resetstats zeroes out the cumulative stats, and dumpstats dumps what has been collected during the benchmark.

This is not perfect since there is some time between the exec syscall for m5 dumpstats finishing and the benchmark starting, but if the benchmark enough, this shouldn't matter.

http://arm.ecs.soton.ac.uk/wp-content/uploads/2016/10/gem5_tutorial.pdf also proposes a few more heuristics:

#!/bin/sh
# Wait for system to calm down
sleep 10
# Take a checkpoint in 100000 ns
m5 checkpoint 100000
# Reset the stats
m5 resetstats
run-benchmark
# Exit the simulation
m5 exit

m5 exit also works since GEM5 dumps stats when it finishes.

Instrumentation instructions

Sometimes those seem to be just inevitable that you have to modify the input source code a bit with those instructions in order to:

  • skip initialization and go directly to steady state
  • evaluate individual main loop runs

You can of course deduce those instructions from the gem5 m5 tool code code, but here are some very easy to re-use one line copy pastes for arm and aarch64, e.g. for aarch64:

/* resetstats */
__asm__ __volatile__ ("mov x0, #0; mov x1, #0; .inst 0XFF000110 | (0x40 << 16);" : : : "x0", "x1")
/* dumpstats */
__asm__ __volatile__ ("mov x0, #0; mov x1, #0; .inst 0xFF000110 | (0x41 << 16);" : : : "x0", "x1")

The m5 tool uses the same mechanism under the hood, but by adding the instructions directly into the source, we avoid the syscall, and therefore more precise and representative (at the cost of more manual work).

To ensure that the assembly is not reordered around your ROI by the compiler however, you might want to use the techniques mentioned at: Enforcing statement order in C++

Address monitoring

Another technique that can be used is to monitory addresses of interest instead of adding magic instructions to the source.

E.g., if you know that a benchmark starts with PIC == 0x400, it should be possible to do something when that addresses is hit.

To find the addresses of interest, you would have for example to use readelf or gdb or tracing, and the if running full system on top of Linux, ensure that ASLR is turned off.

This technique would be the least intrusive one, but the setup is harder, and to be honest I haven't done it yet. One day, one day.