What I find hard to understand is that if these fencing instructions are applied across the entire set of cores (or sockets) or only in effect for a single core. I
A fence issued on one thread translates to effects on the execution on a single core. And they are not only instructions executed by the CPU but also a signal to the compiler to not reorder execution around them.
It would really help me if someone could explain how these fences work in a multi-core processor.
They work in pairs. One thread orders all the writes before a release, the reading thread orders dependent reads after an acquire. If they are not paired properly then you still get races because one of the threads can reorder which means the other thread can observe the reordering.
Note that fences are stronger constructs than atomic writes and reads since they order all memory accesses while ordered accesses only order around accesses to the same memory location, fences may translate to different CPU instructions compared to ordered atomics.
How this translates to machine instructions depends on the architecture. x86 for example provides fairly strong ordering out of the box and thus all but one fence types translate noops on the CPU level and only need to inhibit reorderings performed by the compiler.
ARM on the other hand has a weaker memory model and needs both store and load fence instructions in addition to compiler level barriers.
How those instructions are exactly implemented on the hardware level depends not only on the architecture but individual processor families. It generally involves the cache coherency protocol and additional constraints for out of order pipelines. See this answer for an example how it works in current x86 processors.