Interesting idea, yeah that should probably get the cache line holding your struct into a state in L3 cache where core#2 can get an L3 hit directly, instead of having to wait for a MESI read request while the line is still in M state in the L1d of core#2.
Or if ProcessD is running on the other logical core of the same physical core as ProcessB, data will be fetched into the right L1d. If it spends most of its time asleep (and wakes up infrequently), ProcessB will still usually have the whole CPU to itself, running in single-thread mode without partitioning the ROB and store buffer.
Instead of having the dummy-access thread spinning on usleep(10)
, you could have it wait on a condition variable or a semaphore that ProcessC pokes after writing glbXYZ.
With a counting semaphore (like POSIX C semaphores sem_wait
/sem_post
), the thread that writes glbXYZ
can increment the semaphore, triggering the OS to wake up ProcessD which is blocked in sem_down
. If for some reason ProcessD misses its turn to wake up, it will do 2 iterations before it blocks again, but that's fine. (Hmm, so actually we don't need a counting semaphore, but I think we do want OS-assisted sleep/wake and this is an easy way to get it, unless we need to avoid the overhead of a system call in processC after writing the struct.) Or a raise()
system call in ProcessC could send a signal to trigger wakeup of ProcessD.
With Spectre+Meltdown mitigation, any system call, even an efficient one like Linux futex
is fairly expensive for the thread making it. This cost isn't part of the critical path that you're trying to shorten, though, and it's still much less than the 10 usec sleep you were thinking of between fetches.
void ProcessD(void) {
while(1){
sem_wait(something); // allows one iteration to run per sem_post
__builtin_prefetch (&glbXYZ, 0, 1); // PREFETCHT2 into L2 and L3 cache
}
}
(According to Intel's optimization manual section 7.3.2, PREFETCHT2 on current CPUs is identical to PREFETCHT1, and fetches into L2 cache (and L3 along the way. I didn't check AMD.
What level of the cache does PREFETCHT2 fetch into?).
I haven't tested that PREFETCHT2 will actually be useful here on Intel or AMD CPUs. You might want to use a dummy volatile
access like *(volatile char*)&glbXYZ;
or *(volatile int*)&glbXYZ.field1
. Especially if you have ProcessD running on the same physical core as ProcessB.
If prefetchT2
works, you could maybe do that in the thread that writes bDOIT
(ProcessA), so it could trigger the migration of the line to L3 right before ProcessB will need it.
If you're finding that the line gets evicted before use, maybe you do want a thread spinning on fetching that cache line.
On future Intel CPUs, there's a cldemote
instruction (_cldemote(const void*)
) which you could use after writing to trigger migration of the dirty cache line to L3. It runs as a NOP on CPUs that don't support it, but it's only slated for Tremont (Atom) so far. (Along with umonitor
/umwait
to wake up when another core writes in a monitored range from user-space, which would probably also be super-useful for low latency inter-core stuff.)
Since ProcessA doesn't write the struct, you should probably make sure bDOIT
is in a different cache line than the struct. You might put alignas(64)
on the first member of XYZ
so the struct starts at the start of a cache line. alignas(64) atomic<int> bDOIT;
would make sure it was also at the start of a line, so they can't share a cache line. Or make it an alignas(64) atomic<bool>
or atomic_flag
.
Also see Understanding std::hardware_destructive_interference_size and std::hardware_constructive_interference_size1 : normally 128 is what you want to avoid false sharing because of adjacent-line prefetchers, but it's actually not a bad thing if ProcessB triggers the L2 adjacent-line prefetcher on core#2 to speculatively pull glbXYZ
into its L2 cache when it's spinning on bDOIT
. So you might want to group those into a 128-byte aligned struct if you're on an Intel CPU.
And/or you might even use a software prefetch if bDOIT
is false, in processB. A prefetch won't block waiting for the data, but if the read request arrives in the middle of ProcessC writing glbXYZ
then it will make that take longer. So maybe only SW prefetch every 16th or 64th time bDOIT
is false?
And don't forget to use _mm_pause()
in your spin loop, to avoid a memory-order mis-speculation pipeline nuke when the branch you're spinning on goes the other way. (Normally this is a loop-exit branch in a spin-wait loop, but that's irrelevant. Your branching logic is equivalent to outer infinite loop containing a spin-wait loop and then some work, even though that's not how you've written it.)
Or possibly use lock cmpxchg
instead of a pure load to read the old value. Full barriers already block speculative loads after the barrier, so prevent mis-speculation. (You can do this in C11 with atomic_compare_exchange_weak
with expected = desired. It takes expected
by reference, and updates it if the compare fails.) But hammering on the cache line with lock cmpxchg
is probably not helpful to ProcessA being able to commit its store to L1d quickly.
Check the machine_clears.memory_ordering
perf counter to see if this is happening without _mm_pause
. If it is, then try _mm_pause
first, and then maybe try using atomic_compare_exchange_weak
as a load. Or atomic_fetch_add(&bDOIT, 0)
, because lock xadd
would be equivalent.
// GNU C11. The typedef in your question looks like C, redundant in C++, so I assumed C.
#include <immintrin.h>
#include <stdatomic.h>
#include <stdalign.h>
alignas(64) atomic_bool bDOIT;
typedef struct { int a,b,c,d; // 16 bytes
int e,f,g,h; // another 16
} XYZ;
alignas(64) XYZ glbXYZ;
extern void doSomething(XYZ);
// just one object (of arbitrary type) that might be modified
// maybe cheaper than a "memory" clobber (compile-time memory barrier)
#define MAYBE_MODIFIED(x) asm volatile("": "+g"(x))
// suggested ProcessB
void ProcessB(void) {
int prefetch_counter = 32; // local that doesn't escape
while(1){
if (atomic_load_explicit(&bDOIT, memory_order_acquire)){
MAYBE_MODIFIED(glbXYZ);
XYZ localxyz = glbXYZ; // or maybe a seqlock_read
// MAYBE_MODIFIED(glbXYZ); // worse code from clang, but still good with gcc, unlike a "memory" clobber which can make gcc store localxyz separately from writing it to the stack as a function arg
// asm("":::"memory"); // make sure it finishes reading glbXYZ instead of optimizing away the copy and doing it during doSomething
// localxyz hasn't escaped the function, so it shouldn't be spilled because of the memory barrier
// but if it's too big to be passed in RDI+RSI, code-gen is in practice worse
doSomething(localxyz);
} else {
if (0 == --prefetch_counter) {
// not too often: don't want to slow down writes
__builtin_prefetch(&glbXYZ, 0, 3); // PREFETCHT0 into L1d cache
prefetch_counter = 32;
}
_mm_pause(); // avoids memory order mis-speculation on bDOIT
// probably worth it for latency and throughput
// even though it pauses for ~100 cycles on Skylake and newer, up from ~5 on earlier Intel.
}
}
}
This compiles nicely on Godbolt to pretty nice asm. If bDOIT
stays true, it's a tight loop with no overhead around the call. clang7.0 even uses SSE loads/stores to copy the struct to the stack as a function arg 16 bytes at a time.
Obviously the question is a mess of undefined behaviour which you should fix with _Atomic
(C11) or std::atomic
(C++11) with memory_order_relaxed
. Or mo_release
/ mo_acquire
. You don't have any memory barrier in the function that writes bDOIT
, so it could sink that out of the loop. Making it atomic
with memory-order relaxed has literally zero downside for the quality of the asm.
Presumably you're using a SeqLock or something to protect glbXYZ
from tearing. Yes, asm("":::"memory")
should make that work by forcing the compiler to assume it's been modified asynchronously. The "g"(glbXYZ)
input the the asm statement is useless, though. It's global so the "memory"
barrier already applies to it (because the asm
statement could already reference it). If you wanted to tell the compiler that just it could have changed, use asm volatile("" : "+g"(glbXYZ));
without a "memory"
clobber.
Or in C (not C++), just make it volatile
and do struct assignment, letting the compiler pick how to copy it, without using barriers. In C++, foo x = y;
fails for volatile foo y;
where foo
is an aggregate type like a struct. volatile struct = struct not possible, why?. This is annoying when you want to use volatile
to tell the compiler that data may change asynchronously as part of implementing a SeqLock in C++, but you still want to let the compiler copy it as efficiently as possible in arbitrary order, not one narrow member at a time.
Footnote 1: C++17 specifies std::hardware_destructive_interference_size
as an alternative to hard-coding 64 or making your own CLSIZE constant, but gcc and clang don't implement it yet because it becomes part of the ABI if used in an alignas()
in a struct, and thus can't actually change depending on actual L1d line size.