Why are there alignment boundaries larger than 4?

3

votes

What i don't understand, is why we have to align data in memory on boundaries larger than 4 bytes since all the other boundaries are multiples of 4. Assuming a CPU can read 4 bytes in a cycle, it will be basically no difference in performance if that data is 8 bytes large and is aligned on a 4 byte / 8 byte / 16 byte, etc.

c++compiler-construction

What CPU/architecture are you referring to? - Marco A.

Why do you assume that the CPU reads 4 bytes in a cycle? - jalf

Why are there alignment boundaries larger than 1? ;-) If you can answer that, then you can answer your own question. - Peter - Reinstate Monica

@Marco A: It was just an example. Right now i work with an x64 Intel. - user3503828

@user3503828 you can't ask why a CPU works the way it does, without specifying which CPU you're talking about. - jalf

5

votes

When an x86 CPU reads a double, it reads 8 bytes in a cycle. When it reads an SSE vector, it reads 16 bytes. When it reads an AVX vector, it reads 32.

When the CPU fetches a cache line from memory, it also reads at least 32 bytes.

Your assumption that the CPU reads 4 bytes per cycle is false.

2

votes

First: x86 CPUs don't read stuff in 4 bytes only, they can read 8 bytes in a cycle or even more with SIMD extensions.

But to answer your question "why are there alignment boundaries multiple than 4?", assuming a generic architecture (you didn't specify one and you wrote that x86 was just an example) I'll present a specific case: GPUs.

NVIDIA GPU memory can only be accessed (store/load) if the address is aligned on a multiple of the access size (PTX ISA ld/st). There are different kinds of loads and the most performant ones happen when the address is aligned to a multiple of the access size so if you're trying to load a double from memory (8 bytes) you would have (pseudocode):

ld.double [48dec] // Works, 8 bytes aligned
ld.double [17dec] // Fails, not 8 bytes aligned

in the above case when trying to access (r/w) memory that is not properly aligned the process will actually cause an error. If you want speed you'll have to provide some safety guarantees.

That might answer your question on why alignment boundaries larger than 4 exist in the first place. On such an architecture an access size of 1 is always safe (every address is aligned to 1). That isn't true for every n>1.

Why are there alignment boundaries larger than 4?

2 Answers