4
votes

Assume x86 multi-core PC architecture...

Lets say there are 2 cores (capable of executing 2 separate streams of instructions) and that the interface between the CPU and RAM is a memory bus.

Can 2 instructions (which access some memory) that are scheduled on the 2 different cores truly be simultaneous on such a machine?

I'm not talking about a case where the 2 instructions are accessing the same memory location. Even in the case where the 2 instructions are accessing completely different memory locations (and lets also assume that the memory contents for these locations are not in any cache), I would think that the single memory bus sitting in between the CPU and RAM (which is very common) would cause these 2 instructions to be serialized by the bus arbitration circuitry:

CPU0               CPU1
mov eax,[1000]     mov ebx,[2000]

Is this true? If so, what is the advantage of having multiple cores if the software you will run is multi-threaded but has lots of memory accesses? Wouldn't these instructions all be serialized at the end?

Also, if this is true, whats the point of the LOCK prefix in x86 which is used for making a memory-access instruction atomic?

1
"which access some memory" is the key here. Your title alone has an obvious answer.Paul Draper
You're confusing locks and memory ordering with memory latency. Ordering means you have a single observation time, or ordering point, memory latency is still long and therefore multiple accesses (from several cores or even a single one) can overlap to save time. Naturally if all access the same DRAM you'll have to order them on the bus somehow using the memory controller, but it's still much more efficient than having a single request pending in every moment.Leeor

1 Answers

3
votes

You need to check a few concepts of x86 architecture to answer that:

  • speculative execution (and out of order)
  • load store buffer
  • MESI protocol
  • load forwarding
  • memory barriers
  • NUMA

basically, my guess is your instructions will be absolutely parallel executed but the result in memory will be one or the other of the thread and the election will be decided by MESI hardware.

to extend on the answer, when you have multiple flow and single data (http://en.wikipedia.org/wiki/MISD) you need to expect serialization. Note that this can be mitigated if you access different memory adresses, notably on NUMA systems.

Opterons and new i7 has NUMA hardware, but the OS need to activate them, and its not by default. if you have NUMA, you can use the advantage of one bus to connect one core to one memory zone. however the core must be the owner of that zone, which should be verified if the core allocated its zone itself.

In all other hardware there will be serialization, but if the memory addresses are different they will not hinder on the write performance (no wait before end of write) thanks to the store buffer, and L2 intermediate caching. L2 content is commited to RAM later and L2 is by core so serialization happens but do not hinder CPU instructions that can continue on ahead.

EDIT about the LOCK question: lock x86 instruction is about flushing load store buffers so that other cores can obtain visibility on the current values operated on in the instruction pipeline. this is much closer to the CPU than the RAM writing problem. LOCK allows that cores are not working on their local view of some variable content because without it, the CPU assumes any optimization it can considering only one thread, meaning it will often keep everything in registers and not rely on cache. It can ever go slightly ahead of that, when you consider load fowarding, or more preciselly called store to load forwarding.