I'm currently trying to optimize my software for better CPU cache usage. There are some posts on SO which suggest that it's sometimes hard to guess what the CPU cache is doing and why there are some performance drops in certain cases. For example:
- Why does the speed of memcpy() drop dramatically every 4KB?
- Why is my program slow when looping over exactly 8192 elements?
- Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
So in order to get a clue where the cache misses happen, I can run perf
to get a count of cache misses and where they occur as well as valgrind --tool=cachegrind
to simulate the caches (at least an L1 and a last-level cache).
It's really nice to know where cache misses happen, but I'd like to know why they happen (for example cache trashing etc.). Is there a way to explicitly pause the program and see whats inside the caches (maybe with the program running in valgrind
and vgdb
attached)?