What i don't understand, is why we have to align data in memory on boundaries larger than 4 bytes since all the other boundaries are multiples of 4. Assuming a CPU can read 4 bytes in a cycle, it will be basically no difference in performance if that data is 8 bytes large and is aligned on a 4 byte / 8 byte / 16 byte, etc.
2 Answers
First: x86 CPUs don't read stuff in 4 bytes only, they can read 8 bytes in a cycle or even more with SIMD extensions.
But to answer your question "why are there alignment boundaries multiple than 4?", assuming a generic architecture (you didn't specify one and you wrote that x86 was just an example) I'll present a specific case: GPUs.
NVIDIA GPU memory can only be accessed (store/load) if the address is aligned on a multiple of the access size (PTX ISA ld/st). There are different kinds of loads and the most performant ones happen when the address is aligned to a multiple of the access size so if you're trying to load a double from memory (8 bytes) you would have (pseudocode):
ld.double [48dec] // Works, 8 bytes aligned
ld.double [17dec] // Fails, not 8 bytes aligned
in the above case when trying to access (r/w) memory that is not properly aligned the process will actually cause an error. If you want speed you'll have to provide some safety guarantees.
That might answer your question on why alignment boundaries larger than 4 exist in the first place. On such an architecture an access size of 1 is always safe (every address is aligned to 1). That isn't true for every n>1.