0
votes

Hi I just a newbie to assembly programming. I'm confused how CPU retrieve multibyte (e.g 32 bits for 32 bit machine) from memory. Let say we have an integer i which occupies 4 bytes in memory (starting address at 0x100) so when we use IA32 assembly programming, we just write something like:

movl 8(%esp), %eax

where esp is current stack pointer. 8 is just the offset from the stack pointer address to variable i so when the ia32 instruction executes, cpu just retrieve the byte at 0x100, what about the rest of bytes at 0x101, 0x102, 0x103? How CPU retrieve 32 bits all in once?

Edited: new questions I think I was fundamental wrong on the understanding the word size. But I am still confused but how does 32 bits machine retrieve long integer which is 8 bytes 64 bit, maybe using movq but again what about accessing an objects which is 256 bytes? does CPU just issues movq 4 times? how does cpu know in advance that how many time it need to issue a mov command to retrieve the large size of object?

2
The unit of transfer between the CPU and memory isn't a "byte", it's a "word". Each time you read from memory, you read a "word". The size of the word is 4 bytes (32 bits) in IA32. - Aziz
@Aziz "word" is just an architectural term and it refers to the natural unit of data supported by the ISA. The unit of transfer between the CPU and memory depends on the hardware interface between them, which typically supports multiple units of transfer (typically 8-64 bytes). - Hadi Brais
"cpu just retrieve the byte at 0x100" - why do you even think that. You specified a starting address and a size (length) in your instruction, so the cpu will fetch the appropriate number of bytes. How that is implemented in hardware, is not usually of concern (it's not clear whether you asked about those details). - Jester
@amjad The number of bytes to fetch from memory is encoded in the instruction itself. When the CPU decodes the instruction, it determines how many bytes it is supposed to fetch and the address of the location to fetch from. - Hadi Brais
@old_timer but what if the variable is a char which only occupies one byte in the memory? or if the variable is long int which takes 8 bytes, but how does 32 bits machine retrieve that, maybe using movl, but again what about accessing an objects which is 128 bytes? how does cpu know in advance that how many time it need to issue a mov command to retrieve the complete object? - amjad

2 Answers

3
votes

how does 32 bits machine retrieve long integer which is 8 bytes 64 bit

If you're doing it in integer registers, the compiler has to use multiple instructions, because the architecture doesn't provide an instruction to load two 32-bit registers at once. So the CPU just sees two separate load instructions.

Consider these functions, compiled by gcc7.3 -O3 -m32 for 32-bit x86, with args passed on the stack, and 64-bit integers returned in edx:eax (high half in EDX, low half in EAX). i.e. the i386 System V ABI.

int64_t foo(int64_t a) {
    return a + 2;
}
    movl    4(%esp), %eax
    movl    8(%esp), %edx
    addl    $2, %eax
    adcl    $0, %edx                   # add-with-carry
    ret


int64_t bar(int64_t a, int64_t b) {
    return a + b;
}

    movl    12(%esp), %eax      # low half of b
    addl    4(%esp), %eax       # add low half of a
    movl    16(%esp), %edx
    adcl    8(%esp), %edx       # carry-in from low-half add
    ret

The CPU itself provides instructions that programmers / compilers can use when working with data larger than a register. The CPU only supports the widths that are part of the instruction set, not arbitrary width. This is why we have software.

On x86, the compiler could instead have chosen to use movq into an XMM or MMX register, and used paddq, especially if this was part of a larger function that could store the 64-bit result somewhere in memory instead of needing it in integer registers. But this only works up to the limit of what you can do with vector registers, and they only support elements up to 64 bits wide. There's no 128-bit addition instruction.

how does cpu know in advance that how many time it need to issue a mov command to retrieve the large size of object?

The CPU only has to execute every instruction exactly once, in program order. (Or do whatever it wants internally to give the illusion of doing this).

An x86 CPU has to know how to decode any possible x86 instruction into the right internal operations. If the CPU can only load 128 bits at a time, it has to decode a 256-bit vector load like vmovups (%edi), %ymm0 into multiple load operations internally (like AMD does). See David Kanter's write-up on the Bulldozer microarchitecture.

Or it could decode it to a special load operation that takes two cycles in the load port (like Sandybridge), so 256-bit loads/stores don't cost extra front-end bandwidth, only extra time in the load / store ports.

Or if its internal data path from L1d cache to execution units is wide enough (Haswell and later), it can decode to a single simple load uop that is handled internally by the cache / load port very much like mov (%edi), %eax, or especially vmovd (%edi), %xmm0 (a 32-bit zero-extending load into a vector register).

256 bytes is 32 qwords; no current x86 CPUs can load that much in a single operation.

256 bits is 4 qwords, or one AVX ymm register. Modern Intel CPUs (Haswell and later) have internal data paths that wide, and really can transfer 256 bits at once from cache to a vector load execution unit, executing vmovups ymm0, [rdi] as a single uop. See How can cache be that fast? for more details about how wide loads from cache give extremely high throughput / bandwidth for L1d cache.

1
votes

In general CPUs can load multiple bytes from memory because they are designed to do so and their ISA supports it.

For example, their registers, internal buses, caching design and memory subsystem is designed to do so. Physically a processor capable of loading 64-bit values may have 64 parallel wires in various places to move 64-bits (8 bytes) around the CPU - but other designs are possible, such as a smaller bus of 16-bits that transfers two bytes at a time, or even a bit-serial point-to-point connection which transmits bits one at a time. Different parts of the same CPU may use different designs and different physical widths. For example, reading N bits from DRAM may be implemented as reading M bits in parallel from C chips, with the results merged at the memory controller so the chips need to support a lesser degree of parallelism than other parts of the core to memory path.

The width inherently supported by the ISA may differ from the natural width implemented by the hardware. For example, when Intel added the AVX ISA extension, which was the first to support 256-bit (16-byte) loads and stores, the underlying hardware initially implemented this as a pair of 128-bit operations. Later CPU architectures (Haswell) finally implemented this as full 256-bit width operations. Even today lower-cost x86 chips may split up large load/store operations into smaller units.

Ultimately, these are all internal details of the CPU. What you can rely on is the documented behavior, such as what size of values can be loaded atomically, or for CPUs that document it, how long it takes to load values of types. How it is implemented internally is more of an electrical engineering/CPU design question and there are many ways to do it.