0
votes

I know that on a single core machine, multi-threading programming can increase cache miss, because each thread has its own stack and its own instruction pointer etc. So when multi threads do the context switch, the CPU has to reload another segment of RAM, this will give us cache miss.

So I'm thinking if multi cores can avoid this issue? Saying that I have a program containing two threads and my machine has two cores. If each thread can be assigned to a different core, does this mean that I can avoid the cache miss issue?

1

1 Answers

2
votes

The answer as ever is, it depends.

Assuming that separate cores come with separate L1 caches (not guaranteed, but common enough), then yes, there will be fewer cache misses. But it does depend on how much data is being processed "in one lump" by each thread, and how much processing is done to it before new data needs to be fetched. If that's more than the caches on separate cores, the caching for both will be bumped up to (at least on Intel CPUs) L3 cache, which is shared. The L3 cache then becomes the bottleneck. If the data overflows even that, then it's back to SDRAM which is as slow as it gets. And if the data set is larger than the system's RAM, well that's what the OS's page file is for and that's very slow.

CPU designers generally take a bet that the cache architectures they choose will satisfy a broad swathe of "typical" applications, and they're pretty successful in that regard. But if you're really, really wanting that very last % of performance the "cleverness" of the cache engines may start working against the programmer. The cache might be guessing that your program wants to access data X next, but actually it asks for data Y. Cache miss, big slow down. Understanding exactly what the cache architecture on a chip will do in any specific circumstance can be very difficult to know, and even hard to adapt for when writing one's code.

Some caches allow programmers to drop hints - the PowerPC 7400 family does this, and it's very useful. Rather than rely on the cache engine guessing, the program can tell the cache explicitly that, if it can, it'd be well worth the cache beginning to load up data Y. Use that instruction ahead of time, and when the program actually gets around to processing it, it's already in cache. No cache miss. If the programmer is smart enough to know that they can drop better hints than the cache's guesses, the programmer need only include the relevant instruction at the right points in their program.

The Cell processor from IBM (think: Sony Playstation 3) took this to the extremes. There was no cache at all. Instead, there was 256k of RAM for each maths core on chip with single cycle access (so, like L1 cache). It was left entirely up to the programmer to load data and code up into that RAM from off-chip RAM. It was pretty hard to programme for, but once mastered it was very, very fast.