how does 32 bits machine retrieve long integer which is 8 bytes 64 bit
If you're doing it in integer registers, the compiler has to use multiple instructions, because the architecture doesn't provide an instruction to load two 32-bit registers at once. So the CPU just sees two separate load instructions.
Consider these functions, compiled by gcc7.3 -O3 -m32 for 32-bit x86, with args passed on the stack, and 64-bit integers returned in edx:eax (high half in EDX, low half in EAX). i.e. the i386 System V ABI.
int64_t foo(int64_t a) {
return a + 2;
}
movl 4(%esp), %eax
movl 8(%esp), %edx
addl $2, %eax
adcl $0, %edx # add-with-carry
ret
int64_t bar(int64_t a, int64_t b) {
return a + b;
}
movl 12(%esp), %eax # low half of b
addl 4(%esp), %eax # add low half of a
movl 16(%esp), %edx
adcl 8(%esp), %edx # carry-in from low-half add
ret
The CPU itself provides instructions that programmers / compilers can use when working with data larger than a register. The CPU only supports the widths that are part of the instruction set, not arbitrary width. This is why we have software.
On x86, the compiler could instead have chosen to use movq into an XMM or MMX register, and used paddq, especially if this was part of a larger function that could store the 64-bit result somewhere in memory instead of needing it in integer registers. But this only works up to the limit of what you can do with vector registers, and they only support elements up to 64 bits wide. There's no 128-bit addition instruction.
how does cpu know in advance that how many time it need to issue a mov command to retrieve the large size of object?
The CPU only has to execute every instruction exactly once, in program order. (Or do whatever it wants internally to give the illusion of doing this).
An x86 CPU has to know how to decode any possible x86 instruction into the right internal operations. If the CPU can only load 128 bits at a time, it has to decode a 256-bit vector load like vmovups (%edi), %ymm0 into multiple load operations internally (like AMD does). See David Kanter's write-up on the Bulldozer microarchitecture.
Or it could decode it to a special load operation that takes two cycles in the load port (like Sandybridge), so 256-bit loads/stores don't cost extra front-end bandwidth, only extra time in the load / store ports.
Or if its internal data path from L1d cache to execution units is wide enough (Haswell and later), it can decode to a single simple load uop that is handled internally by the cache / load port very much like mov (%edi), %eax, or especially vmovd (%edi), %xmm0 (a 32-bit zero-extending load into a vector register).
256 bytes is 32 qwords; no current x86 CPUs can load that much in a single operation.
256 bits is 4 qwords, or one AVX ymm register. Modern Intel CPUs (Haswell and later) have internal data paths that wide, and really can transfer 256 bits at once from cache to a vector load execution unit, executing vmovups ymm0, [rdi] as a single uop. See How can cache be that fast? for more details about how wide loads from cache give extremely high throughput / bandwidth for L1d cache.