What does multicore assembly language look like?

262

votes

Once upon a time, to write x86 assembler, for example, you would have instructions stating "load the EDX register with the value 5", "increment the EDX" register, etc.

With modern CPUs that have 4 cores (or even more), at the machine code level does it just look like there are 4 separate CPUs (i.e. are there just 4 distinct "EDX" registers) ? If so, when you say "increment the EDX register", what determines which CPU's EDX register is incremented? Is there a "CPU context" or "thread" concept in x86 assembler now?

How does communication/synchronization between the cores work?

If you were writing an operating system, what mechanism is exposed via hardware to allow you to schedule execution on different cores? Is it some special priviledged instruction(s)?

If you were writing an optimizing compiler/bytecode VM for a multicore CPU, what would you need to know specifically about, say, x86 to make it generate code that runs efficiently across all the cores?

What changes have been made to x86 machine code to support multi-core functionality?

assemblyx86cpumulticoresmp

There's a similar (though not identical) question here: stackoverflow.com/questions/714905/… – Nathan Fellman

10 Answers

167

votes

This isn't a direct answer to the question, but it's an answer to a question that appears in the comments. Essentially, the question is what support the hardware gives to multi-threaded operation.

Nicholas Flynt had it right, at least regarding x86. In a multi threaded environment (Hyper-threading, multi-core or multi-processor), the Bootstrap thread (usually thread 0 in core 0 in processor 0) starts up fetching code from address 0xfffffff0. All the other threads start up in a special sleep state called Wait-for-SIPI. As part of its initialization, the primary thread sends a special inter-processor-interrupt (IPI) over the APIC called a SIPI (Startup IPI) to each thread that is in WFS. The SIPI contains the address from which that thread should start fetching code.

This mechanism allows each thread to execute code from a different address. All that's needed is software support for each thread to set up its own tables and messaging queues. The OS uses those to do the actual multi-threaded scheduling.

As far as the actual assembly is concerned, as Nicholas wrote, there's no difference between the assemblies for a single threaded or multi threaded application. Each logical thread has its own register set, so writing:

mov edx, 0

will only update EDX for the currently running thread. There's no way to modify EDX on another processor using a single assembly instruction. You need some sort of system call to ask the OS to tell another thread to run code that will update its own EDX.

111

votes

Intel x86 minimal runnable baremetal example

Runnable bare metal example with all required boilerplate. All major parts are covered below.

Tested on Ubuntu 15.10 QEMU 2.3.0 and Lenovo ThinkPad T400 real hardware guest.

The Intel Manual Volume 3 System Programming Guide - 325384-056US September 2015 covers SMP in chapters 8, 9 and 10.

Table 8-1. "Broadcast INIT-SIPI-SIPI Sequence and Choice of Timeouts" contains an example that basically just works:

MOV ESI, ICR_LOW    ; Load address of ICR low dword into ESI.
MOV EAX, 000C4500H  ; Load ICR encoding for broadcast INIT IPI
                    ; to all APs into EAX.
MOV [ESI], EAX      ; Broadcast INIT IPI to all APs
; 10-millisecond delay loop.
MOV EAX, 000C46XXH  ; Load ICR encoding for broadcast SIPI IP
                    ; to all APs into EAX, where xx is the vector computed in step 10.
MOV [ESI], EAX      ; Broadcast SIPI IPI to all APs
; 200-microsecond delay loop
MOV [ESI], EAX      ; Broadcast second SIPI IPI to all APs
                    ; Waits for the timer interrupt until the timer expires

On that code:

Most operating systems will make most of those operations impossible from ring 3 (user programs).

So you need to write your own kernel to play freely with it: a userland Linux program will not work.
At first, a single processor runs, called the bootstrap processor (BSP).

It must wake up the other ones (called Application Processors (AP)) through special interrupts called Inter Processor Interrupts (IPI).

Those interrupts can be done by programming Advanced Programmable Interrupt Controller (APIC) through the Interrupt command register (ICR)

The format of the ICR is documented at: 10.6 "ISSUING INTERPROCESSOR INTERRUPTS"

The IPI happens as soon as we write to the ICR.
ICR_LOW is defined at 8.4.4 "MP Initialization Example" as:
```
ICR_LOW EQU 0FEE00300H
```
The magic value 0FEE00300 is the memory address of the ICR, as documented at Table 10-1 "Local APIC Register Address Map"
The simplest possible method is used in the example: it sets up the ICR to send broadcast IPIs which are delivered to all other processors except the current one.

But it is also possible, and recommended by some, to get information about the processors through special data structures setup by the BIOS like ACPI tables or Intel's MP configuration table and only wake up the ones you need one by one.
XX in 000C46XXH encodes the address of the first instruction that the processor will execute as:
```
CS = XX * 0x100
IP = 0
```
Remember that CS multiples addresses by 0x10, so the actual memory address of the first instruction is:
```
XX * 0x1000
```
So if for example XX == 1, the processor will start at 0x1000.

We must then ensure that there is 16-bit real mode code to be run at that memory location, e.g. with:
```
cld
mov $init_len, %ecx
mov $init, %esi
mov 0x1000, %edi
rep movsb

.code16
init:
    xor %ax, %ax
    mov %ax, %ds
    /* Do stuff. */
    hlt
.equ init_len, . - init
```
Using a linker script is another possibility.
The delay loops are an annoying part to get working: there is no super simple way to do such sleeps precisely.

Possible methods include:
- PIT (used in my example)
- HPET
- calibrate the time of a busy loop with the above, and use it instead
Related: How to display a number on the screen and and sleep for one second with DOS x86 assembly?
I think the initial processor needs to be in protected mode for this to work as we write to address 0FEE00300H which is too high for 16-bits
To communicate between processors, we can use a spinlock on the main process, and modify the lock from the second core.

We should ensure that memory write back is done, e.g. through wbinvd.

Shared state between processors

8.7.1 "State of the Logical Processors" says:

The following features are part of the architectural state of logical processors within Intel 64 or IA-32 processors supporting Intel Hyper-Threading Technology. The features can be subdivided into three groups:

Duplicated for each logical processor

Shared by logical processors in a physical processor

Shared or duplicated, depending on the implementation

The following features are duplicated for each logical processor:

General purpose registers (EAX, EBX, ECX, EDX, ESI, EDI, ESP, and EBP)

Segment registers (CS, DS, SS, ES, FS, and GS)

EFLAGS and EIP registers. Note that the CS and EIP/RIP registers for each logical processor point to the instruction stream for the thread being executed by the logical processor.

x87 FPU registers (ST0 through ST7, status word, control word, tag word, data operand pointer, and instruction pointer)

MMX registers (MM0 through MM7)

XMM registers (XMM0 through XMM7) and the MXCSR register

Control registers and system table pointer registers (GDTR, LDTR, IDTR, task register)

Debug registers (DR0, DR1, DR2, DR3, DR6, DR7) and the debug control MSRs

Machine check global status (IA32_MCG_STATUS) and machine check capability (IA32_MCG_CAP) MSRs

Thermal clock modulation and ACPI Power management control MSRs

Time stamp counter MSRs

Most of the other MSR registers, including the page attribute table (PAT). See the exceptions below.

Local APIC registers.

Additional general purpose registers (R8-R15), XMM registers (XMM8-XMM15), control register, IA32_EFER on Intel 64 processors.

The following features are shared by logical processors:

Memory type range registers (MTRRs)

Whether the following features are shared or duplicated is implementation-specific:

IA32_MISC_ENABLE MSR (MSR address 1A0H)

Machine check architecture (MCA) MSRs (except for the IA32_MCG_STATUS and IA32_MCG_CAP MSRs)

Performance monitoring control and counter MSRs

Cache sharing is discussed at:

Intel hyperthreads have greater cache and pipeline sharing than separate cores: https://superuser.com/questions/133082/hyper-threading-and-dual-core-whats-the-difference/995858#995858

Linux kernel 4.2

The main initialization action seems to be at arch/x86/kernel/smpboot.c.

ARM minimal runnable baremetal example

Here I provide a minimal runnable ARMv8 aarch64 example for QEMU:

.global mystart
mystart:
    /* Reset spinlock. */
    mov x0, #0
    ldr x1, =spinlock
    str x0, [x1]

    /* Read cpu id into x1.
     * TODO: cores beyond 4th?
     * Mnemonic: Main Processor ID Register
     */
    mrs x1, mpidr_el1
    ands x1, x1, 3
    beq cpu0_only
cpu1_only:
    /* Only CPU 1 reaches this point and sets the spinlock. */
    mov x0, 1
    ldr x1, =spinlock
    str x0, [x1]
    /* Ensure that CPU 0 sees the write right now.
     * Optional, but could save some useless CPU 1 loops.
     */
    dmb sy
    /* Wake up CPU 0 if it is sleeping on wfe.
     * Optional, but could save power on a real system.
     */
    sev
cpu1_sleep_forever:
    /* Hint CPU 1 to enter low power mode.
     * Optional, but could save power on a real system.
     */
    wfe
    b cpu1_sleep_forever
cpu0_only:
    /* Only CPU 0 reaches this point. */

    /* Wake up CPU 1 from initial sleep!
     * See:https://github.com/cirosantilli/linux-kernel-module-cheat#psci
     */
    /* PCSI function identifier: CPU_ON. */
    ldr w0, =0xc4000003
    /* Argument 1: target_cpu */
    mov x1, 1
    /* Argument 2: entry_point_address */
    ldr x2, =cpu1_only
    /* Argument 3: context_id */
    mov x3, 0
    /* Unused hvc args: the Linux kernel zeroes them,
     * but I don't think it is required.
     */
    hvc 0

spinlock_start:
    ldr x0, spinlock
    /* Hint CPU 0 to enter low power mode. */
    wfe
    cbz x0, spinlock_start

    /* Semihost exit. */
    mov x1, 0x26
    movk x1, 2, lsl 16
    str x1, [sp, 0]
    mov x0, 0
    str x0, [sp, 8]
    mov x1, sp
    mov w0, 0x18
    hlt 0xf000

spinlock:
    .skip 8

GitHub upstream.

Assemble and run:

aarch64-linux-gnu-gcc \
  -mcpu=cortex-a57 \
  -nostdlib \
  -nostartfiles \
  -Wl,--section-start=.text=0x40000000 \
  -Wl,-N \
  -o aarch64.elf \
  -T link.ld \
  aarch64.S \
;
qemu-system-aarch64 \
  -machine virt \
  -cpu cortex-a57 \
  -d in_asm \
  -kernel aarch64.elf \
  -nographic \
  -semihosting \
  -smp 2 \
;

In this example, we put CPU 0 in a spinlock loop, and it only exits with CPU 1 releases the spinlock.

After the spinlock, CPU 0 then does a semihost exit call which makes QEMU quit.

If you start QEMU with just one CPU with -smp 1, then simulation just hangs forever on the spinlock.

CPU 1 is woken up with the PSCI interface, more details at: ARM: Start/Wakeup/Bringup the other CPU cores/APs and pass execution start address?

The upstream version also has a few tweaks to make it work on gem5, so you can experiment with performance characteristics as well.

I haven't tested it on real hardware, so and I'm not sure how portable this is. The following Raspberry Pi bibliography might be of interest:

This document provides some guidance on using ARM synchronization primitives which you can then use to do fun things with multiple cores: http://infocenter.arm.com/help/topic/com.arm.doc.dht0008a/DHT0008A_arm_synchronization_primitives.pdf

Tested on Ubuntu 18.10, GCC 8.2.0, Binutils 2.31.1, QEMU 2.12.0.

Next steps for more convenient programmability

The previous examples wake up secondary CPU and do basic memory synchronization with dedicated instructions, which is a good start.

But to make multicore systems easy to program, e.g. like POSIX pthreads, you would also need to go into the following more involved topics:

setup interrupts and run a timer that periodically decides which thread will run now. This is known as preemptive multithreading.

Such system also needs to save and restore thread registers as they are started and stopped.

It is also possible to have non-preemptive multitasking systems, but those might require you to modify your code so that every threads yields (e.g. with a pthread_yield implementation), and it becomes harder to balance workloads.

Here are some simplistic bare metal timer examples:
- x86 PIT
deal with memory conflicts. Notably, each thread will need a unique stack if you want to code in C or other high level languages.

You could just limit threads to have a fixed maximum stack size, but the nicer way to deal with this is with paging which allows for efficient "unlimited size" stacks.

Here is a naive aarch64 baremetal example that would blow up if the stack grows too deep

Those are some good reasons to use the Linux kernel or some other operating system :-)

Userland memory synchronization primitives

Although thread start / stop / management is generally beyond userland scope, you can however use assembly instructions from userland threads to synchronize memory accesses without potentially more expensive system calls.

You should of course prefer using libraries that portably wrap these low level primitives. The C++ standard itself has made great advances on the <mutex> and <atomic> headers, and in particular with std::memory_order. I'm not sure if it covers all possible memory semantics achievable, but it just might.

The more subtle semantics are particularly relevant in the context of lock free data structures, which can offer performance benefits in certain cases. To implement those, you will likely have to learn a bit about the different types of memory barriers: https://preshing.com/20120710/memory-barriers-are-like-source-control-operations/

Boost for example has some lock free container implementations at: https://www.boost.org/doc/libs/1_63_0/doc/html/lockfree.html

Such userland instructions also appear to be used to implement the Linux futex system call, which is one of the main synchronization primitives in Linux. man futex 4.15 reads:

The futex() system call provides a method for waiting until a certain condition becomes true. It is typically used as a blocking construct in the context of shared-memory synchronization. When using futexes, the majority of the synchronization operations are performed in user space. A user-space program employs the futex() system call only when it is likely that the program has to block for a longer time until the condition becomes true. Other futex() operations can be used to wake any processes or threads waiting for a particular condition.

The syscall name itself means "Fast Userspace XXX".

Here is a minimal useless C++ x86_64 / aarch64 example with inline assembly that illustrates basic usage of such instructions mostly for fun:

main.cpp

#include <atomic>
#include <cassert>
#include <iostream>
#include <thread>
#include <vector>

std::atomic_ulong my_atomic_ulong(0);
unsigned long my_non_atomic_ulong = 0;
#if defined(__x86_64__) || defined(__aarch64__)
unsigned long my_arch_atomic_ulong = 0;
unsigned long my_arch_non_atomic_ulong = 0;
#endif
size_t niters;

void threadMain() {
    for (size_t i = 0; i < niters; ++i) {
        my_atomic_ulong++;
        my_non_atomic_ulong++;
#if defined(__x86_64__)
        __asm__ __volatile__ (
            "incq %0;"
            : "+m" (my_arch_non_atomic_ulong)
            :
            :
        );
        // https://github.com/cirosantilli/linux-kernel-module-cheat#x86-lock-prefix
        __asm__ __volatile__ (
            "lock;"
            "incq %0;"
            : "+m" (my_arch_atomic_ulong)
            :
            :
        );
#elif defined(__aarch64__)
        __asm__ __volatile__ (
            "add %0, %0, 1;"
            : "+r" (my_arch_non_atomic_ulong)
            :
            :
        );
        // https://github.com/cirosantilli/linux-kernel-module-cheat#arm-lse
        __asm__ __volatile__ (
            "ldadd %[inc], xzr, [%[addr]];"
            : "=m" (my_arch_atomic_ulong)
            : [inc] "r" (1),
              [addr] "r" (&my_arch_atomic_ulong)
            :
        );
#endif
    }
}

int main(int argc, char **argv) {
    size_t nthreads;
    if (argc > 1) {
        nthreads = std::stoull(argv[1], NULL, 0);
    } else {
        nthreads = 2;
    }
    if (argc > 2) {
        niters = std::stoull(argv[2], NULL, 0);
    } else {
        niters = 10000;
    }
    std::vector<std::thread> threads(nthreads);
    for (size_t i = 0; i < nthreads; ++i)
        threads[i] = std::thread(threadMain);
    for (size_t i = 0; i < nthreads; ++i)
        threads[i].join();
    assert(my_atomic_ulong.load() == nthreads * niters);
    // We can also use the atomics direclty through `operator T` conversion.
    assert(my_atomic_ulong == my_atomic_ulong.load());
    std::cout << "my_non_atomic_ulong " << my_non_atomic_ulong << std::endl;
#if defined(__x86_64__) || defined(__aarch64__)
    assert(my_arch_atomic_ulong == nthreads * niters);
    std::cout << "my_arch_non_atomic_ulong " << my_arch_non_atomic_ulong << std::endl;
#endif
}

GitHub upstream.

Possible output:

my_non_atomic_ulong 15264
my_arch_non_atomic_ulong 15267

From this we see that the x86 LOCK prefix / aarch64 LDADD instruction made the addition atomic: without it we have race conditions on many of the adds, and the total count at the end is less than the synchronized 20000.

The Unofficial SMP FAQ

Once upon a time, to write x86 assembler, for example, you would have instructions stating "load the EDX register with the value 5", "increment the EDX" register, etc. With modern CPUs that have 4 cores (or even more), at the machine code level does it just look like there are 4 separate CPUs (i.e. are there just 4 distinct "EDX" registers) ?

Exactly. There are 4 sets of registers, including 4 separate instruction pointers.

If so, when you say "increment the EDX register", what determines which CPU's EDX register is incremented?

The CPU that executed that instruction, naturally. Think of it as 4 entirely different microprocessors that are simply sharing the same memory.

Is there a "CPU context" or "thread" concept in x86 assembler now?

No. The assembler just translates instructions like it always did. No changes there.

How does communication/synchronization between the cores work?

Since they share the same memory, it's mostly a matter of program logic. Although there now is an inter-processor interrupt mechanism, it's not necessary and was not originally present in the first dual-CPU x86 systems.

If you were writing an operating system, what mechanism is exposed via hardware to allow you to schedule execution on different cores?

The scheduler actually doesn't change, except that it is slightly more carefully about critical sections and the types of locks used. Before SMP, kernel code would eventually call the scheduler, which would look at the run queue and pick a process to run as the next thread. (Processes to the kernel look a lot like threads.) The SMP kernel runs the exact same code, one thread at a time, it's just that now critical section locking needs to be SMP-safe to be sure two cores can't accidentally pick the same PID.

Is it some special privileged instruction(s)?

No. The cores are just all running in the same memory with the same old instructions.

You run the same code as before. It's the Unix or Windows kernel that needed to change.

You could summarize my question as "What changes have been made to x86 machine code to support multi-core functionality?"

Nothing was necessary. The first SMP systems used the exact same instruction set as uniprocessors. Now, there has been a great deal of x86 architecture evolution and zillions of new instructions to make things go faster, but none were necessary for SMP.

For more information, see the Intel Multiprocessor Specification.

Update:n¹n²how is a program written to run on more than one core for more performance?Pthreads.

^{1. For backwards compatibility, only the first core starts up at reset, and a few driver-type things need to be done to fire up the remaining ones.
2. They also share all the peripherals, naturally.}

votes

If you were writing an optimizing compiler/bytecode VM for a multicore CPU, what would you need to know specifically about, say, x86 to make it generate code that runs efficiently across all the cores?

As someone who writes optimizing compiler/bytecode VMs I may be able to help you here.

You do not need to know anything specifically about x86 to make it generate code that runs efficiently across all the cores.

However, you may need to know about cmpxchg and friends in order to write code that runs correctly across all the cores. Multicore programming requires the use of synchronisation and communication between threads of execution.

You may need to know something about x86 to make it generate code that runs efficiently on x86 in general.

There are other things it would be useful for you to learn:

You should learn about the facilities the OS (Linux or Windows or OSX) provides to allow you to run multiple threads. You should learn about parallelization APIs such as OpenMP and Threading Building Blocks, or OSX 10.6 "Snow Leopard"'s forthcoming "Grand Central".

You should consider if your compiler should be auto-parallelising, or if the author of the applications compiled by your compiler needs to add special syntax or API calls into his program to take advantage of the multiple cores.

votes

Each Core executes from a different memory area. Your operating system will point a core at your program and the core will execute your program. Your program will not be aware that there are more than one core or on which core it is executing.

There is also no additional instruction only available to the Operating System. These cores are identical to single core chips. Each Core runs a part of the Operating System that will handle communication to common memory areas used for information interchange to find the next memory area to execute.

This is a simplification but it gives you the basic idea of how it is done. More about multicores and multiprocessors on Embedded.com has lots of information about this topic ... This topic get complicated very quickly!

votes

The assembly code will translate into machine code that will be executed on one core. If you want it to be multithreaded you will have to use operating system primitives to start this code on different processors several times or different pieces of code on different cores - each core will execute a separate thread. Each thread will only see one core it is currently executing on.

votes

It's not done in machine instructions at all; the cores pretend to be distinct CPUs and don't have any special capabilities for talking to one another. There are two ways they communicate:

they share the physical address space. The hardware handles cache coherency, so one CPU writes to a memory address which another reads.
they share an APIC (programmable interrupt controller). This is memory mapped into the physical address space, and can be used by one processor to control the others, turn them on or off, send interrupts, etc.

http://www.cheesecake.org/sac/smp.html is a good reference with a silly url.

votes

The main difference between a single- and a multi-threaded application is that the former has one stack and the latter has one for each thread. Code is generated somewhat differently since the compiler will assume that the data and stack segment registers (ds and ss) are not equal. This means that indirection through the ebp and esp registers that default to the ss register won't also default to ds (because ds!=ss). Conversely, indirection through the other registers which default to ds won't default to ss.

The threads share everything else including data and code areas. They also share lib routines so make sure that they are thread-safe. A procedure that sorts an area in RAM can be multi-threaded to speed things up. The threads will then be accessing, comparing and ordering data in the same physical memory area and executing the same code but using different local variables to control their respective part of the sort. This of course is because the threads have different stacks where the local variables are contained. This type of programming requires careful tuning of the code so that inter-core data collisions (in caches and RAM) are reduced which in turn results in a code which is faster with two or more threads than it is with just one. Of course, an un-tuned code will often be faster with one processor than with two or more. To debug is more challenging because the standard "int 3" breakpoint will not be applicable since you want to interrupt a specific thread and not all of them. Debug register breakpoints do not solve this problem either unless you can set them on the specific processor executing the specific thread you want to interrupt.

Other multi-threaded code may involve different threads running in different parts of the program. This type of programming does not require the same kind of tuning and is therefore much easier to learn.

votes

What has been added on every multiprocessing-capable architecture compared to the single-processor variants that came before them are instructions to synchronize between cores. Also, you have instructions to deal with cache coherency, flushing buffers, and similar low-level operations an OS has to deal with. In the case of simultaneous multithreaded architectures like IBM POWER6, IBM Cell, Sun Niagara, and Intel "Hyperthreading", you also tend to see new instructions to prioritize between threads (like setting priorities and explicitly yielding the processor when there is nothing to do).

But the basic single-thread semantics are the same, you just add extra facilities to handle synchronization and communication with other cores.