2
votes

I wrote something to measure how long my code takes to run, and to print it out. The way I have it now supports nesting of these measurements.

The thing is that in the process of getting the time interval, converting it into a number, getting time format, and then converting everything into a string, and then print it out takes a while (2-3 milliseconds), I/O especially seems expensive. I want the clock to "skip over" this process in a sense, since the stuff I'm measuring would be in the microseconds. (and I think it'd be a good feature regardless, if I have other things I want to skip)

std::chrono::high_resolution_clock clock;
std::chrono::time_point<std::chrono::steady_clock> first, second;

first = clock.now();

std::chrono::time_point<std::chrono::high_resolution_clock> paused_tp = clock.now();
std::cout << "Printing things \n";
clock.setTime(paused_tp); // Something like this is what I want

second = clock.now();

The idea is to make sure that first and second have minimal differences, ideally identical.

From what I see, the high_resolution_clock class (and all the other chrono clocks), keep their time_point private, and you can only access it from clock.now()

I know there might be benchmarking libraries out there that do this, but I'd like to figure out how to do it myself (if only for the sake of knowing how to do it). Any information on how other libraries do it or insights on how chrono works would be appreciated as well. I might be misunderstanding how chrono internally keeps track.

(Is std::chrono::high_resolution_clock even accurate enough for something like this?)

(While I'm here any resources on making C++ programs more efficient would be great)

Edit: I actually do printing after the section of code I'm trying to time, the problem only arises in, say, when I want to time the entire program, as well as individual functions. Then the printing of the function's time would cause delay in overall program time.

Edit 2: I figured I should have more of an example of what I'm doing. I have a class that handles everything, let's say it's called tracker, it takes care of all that clock nonsense.

tracker loop = TRACK(
    for(int i = 0; i != 100; ++i){
        tracker b = TRACK(function_call());
        b.display();
    }
)
loop.display();

The macro is optional, it just a quick thing that makes it less cluttered and allows me to display the name of the function being called.

Explicitly the macro expands to

tracker loop = "for(int i = 0; i != 100; ++i){ tracker b = TRACK(function_call()); b.display(); }"
loop.start()
for(int i = 0; i != 100; ++i){
    tracker b = "function_call()"
    b.start();
    function_call();
    b.end();
    b.display();
}
loop.end();
loop.display();

In most situations the printing isn't an issue, it only keeps track of what's between start() and end(), but here the b.display() ends up interfering with the tracker loop.

A goal of mine with this was for the tracker to be as non-intrusive as possible, so I'd like most/all of it to be handled in the tracker class. But then I run into the problem of b.display() being a method of a different instance than the tracker loop. I've tried a few things with the static keyword but ran into a few issues with that (still trying a little). I might've coded myself into a dead end here, but there are still a lot of things left to try.

2
smth like: ` std::chrono::time_point<std::chrono::steady_clock> first2 = clock.now(); std::cout << "printing thinkgs"; std::chrono::time_point<std::chrono::steady_clock> second2 = clock.now(); first += second2 - first2; `KamilCuk

2 Answers

1
votes

Just time the two intervals separately and add them, i.e. save 4 total timestamps. For nested intervals, you might just save timestamps into an array and sort everything out at the end. (Or inside an outer loop before timestamps get overwritten). Storing to an array is quite cheap.


Or better: defer printing until later.

If the timed interval is only milliseconds, just save what you were going to print and do it outside the timed interval.

If you have nested timed intervals, at least sink the printing out of the inner-most intervals to minimize the amount of stop/restart you have to do.

If you're manually instrumenting your code all over the place, maybe look at profiling tools like flamegraph, especially if your timed intervals break down on function boundaries. linux perf: how to interpret and find hotspots.


Not only does I/O take time itself, it will make later code run slower for a few hundreds or thousands of cycles. Making a system call touches a lot of code, so when you return to user-space it's likely that you'll get instruction-cache and data-cache misses. System calls that modify your page tables will also result in TLB misses.

(See "The (Real) Costs of System Calls" section in the FlexSC paper (Soares, Stumm), timed on an i7 Nehalem running Linux. (First-gen i7 from ~2008/9). The paper proposes a batched system-call mechanism for high-throughput web-servers and similar, but their baseline results for plain Linux are interesting and relevant outside of that.)

On a modern Intel CPU with Meltdown mitigation enabled, you'll usually get TLB misses. With Spectre mitigation enabled on recent x86, branch-prediction history will probably be wiped out, depending on the mitigation strategy. (Intel added a way for the kernel to request that higher-privileged branches after this point won't be affected by prediction history for lower-privileged branches. On current CPUs, I think this just flushes the branch-prediction cache.)

You can avoid the system-call overhead by letting iostream buffer for you. It's still significant work formatting and copying data around, but much cheaper than writing to a terminal. Redirecting your program's stdout to a file will make cout full-buffered by default, instead of line-buffered. i.e. run it like this:

 ./my_program > time_log.txt

The final output will match what you would have got on the terminal, but (as long as you don't do anything silly like using std::endl to force a flush) it will just be buffered up. The default buffer size is probably something like 4kiB. Use strace ./my_program or a similar tool to trace system calls, and make sure you're getting one big write() at the end, instead of lots of small write()s.

It would be nice to avoid buffered I/O inside (outer) timed regions, but it's very important to avoid system calls in places your "real" (non-instrumented) code wouldn't have them, if you're timing down to nanoseconds. And this is true even before timed intervals, not just inside.

cout << foo if it doesn't make a system call isn't "special" in terms of slowing down later code.

0
votes

To overcome the overhead, the time lapse printing can be done by another thread. The main thread saves a start and end time into shared global variables, and notifies the condition variable the print thread is waiting on.

#include<iostream>
#include<thread>
#include<chrono>
#include<mutex>
#include<condition_variable>
#include<atomic>

std::condition_variable cv;
std::mutex mu;
std::atomic<bool> running {true};
std::atomic<bool> printnow {false};

// shared but non-atomic: protected by the mutex and condition variable.
// possible bug: the main thread can write `now` before print_thread wakes up and reads it
std::chrono::high_resolution_clock::time_point start;
std::chrono::high_resolution_clock::time_point now;


void print_thread() {
    std::thread([]() {
        while (running) {
            std::unique_lock<std::mutex> lock(mu);
            cv.wait(lock, []() { return !running || printnow; });
            if (!running) return;
            std::chrono::milliseconds lapse_ms = std::chrono::duration_cast<std::chrono::milliseconds>(now - start);
            printnow = false;
            std::cout << " lapse time " << lapse_ms.count() << "  ms\n";
        }
    }).detach();
}

void print_lapse(std::chrono::high_resolution_clock::time_point start1, std::chrono::high_resolution_clock::time_point now1) {
    start = start1;
    now = now1;
    printnow = true;
    cv.notify_one();
}

int main()
{
    //launch thread
    print_thread(); 

    // laspe1
    std::chrono::high_resolution_clock::time_point start1 = std::chrono::high_resolution_clock::now();
    std::this_thread::sleep_for(std::chrono::milliseconds(200));
    std::chrono::high_resolution_clock::time_point now1 = std::chrono::high_resolution_clock::now();
    print_lapse(start1,now1);

    // laspe2
    start1 = std::chrono::high_resolution_clock::now();
    std::this_thread::sleep_for(std::chrono::milliseconds(500));
    now1 = std::chrono::high_resolution_clock::now();
    print_lapse(start1, now1);

    //winding up
    std::this_thread::sleep_for(std::chrono::milliseconds(300));
    running = false;
    cv.notify_one();
}