Determine memory access time in X86 assembly

Question

I am trying to determine the time to access two memory addresses which a separated by a certain delta. My code has to be mixed x86 and C and will be run "bare metal" (without any OS; edit: I am actually modifying memtest) in order to get the most precise result.

I am more used to ARM assembly than x86, thus I might have done some mistakes (and am wondering, why mov does so many different things in x86). My code so far is as follows.

inline unsigned timeread (ulong addr, ulong delta, int iter)
{
    ulong daddr;
    int i;
    ulong st_low, st_high;
    ulong end_low, end_high;

    daddr = addr + delta;

    asm __volatile__ ("rdtsc":"=a" (st_low),"=d" (st_high));

    for (i = 0; i < iter; ++i)
    {
        asm __volatile__ (
            "movl (%0), %%eax\n\t"
            :
            : "D" (addr)
            : "eax"
        );

        asm __volatile__ (
            "movl (%0), %%eax\n\t"
            :
            : "D" (daddr)
            : "eax"
        );
    }

    asm __volatile__ ("rdtsc":"=a" (end_low),"=d" (end_high));

    asm __volatile__ (
        "subl %2,%0\n\t"
        "sbbl %3,%1"
        : "=a" (end_low), "=d" (end_high)
        : "g" (st_low), "g" (st_high),
            "0" (end_low), "1" (end_high)
    );

    return end_low;
}

I am using gcc and compilling with the flags -march=i486 -m32.

EDIT : Before calling the function, I call a function provided by memtest, set_cache(0), in order to deactivate the cache (at least, that is what it says). Calling set_cache(1) instead reduces the execution time extremely (I drop from ~2000 cycles to <10 cycles). If there is still some cache left, I suppose memtest didn't find a solution for this problem either.

EDIT : Asking the question usually helps getting an answer ...
Is the assembly code correct ? I am quite surprised that mov is able to access the RAM in x86, since in ARM you would use the specialised LDR for this.

I'd dispute your claim that this is C. In C, integers and pointers are distinct. — EOF
This doesn't look very useful, what would actually happen is two memory accesses and then a ton of cache accesses. — harold
You don't actually ask a question, but as harold say this doesn't seem to measure anything useful. It will also be difficult to execute this code on "bare metal" as GCC will generate 32-bit code. — Ross Ridge
@EOF : The code is based on memtest and I admit that there are some things that aren't particularly beautiful. — phexcaer
@harold: The cache is deactivated before the function is called. — phexcaer

Ross Ridge Ross Ridge · Accepted Answer · 2015-06-03T15:48:34

Putting aside the question of whether your code actually measures what you want it to, your code is correct. The Intel x86 instruction set doesn't follow the same RISC principles used when the ARM instruction set was designed. With RISC there are separate instructions specifically for loading and storing to memory. This is a departure from the earlier style of CPU design (retroactively named CISC) where most instructions could access memory directly. Not just move instructions but also things like add and jump. So yes, your MOVL instructions loads EAX with a 32-bit value in memory as you intend.

Note that because you used the "D" constraint with both the MOVL asm statements the compiler has to reload EDI before each asm statement. Since there's no apparent reason to constrain it to EDI, I would recommend using the "r" constraint so the compiler can pick two different registers. When optimization is enabled the compiler can load these two registers just once outside of the loop, rather than each time during the loop. (If you don't use optimization then the either way the compiler will load the register from where addr and daddr are stored on the stack during every loop iteration.)

For example:

    asm volatile (
        "movl (%0), %%eax\n\t"
        :
        : "r" (addr)
        : "eax"
    );

    asm volatile (
        "movl (%0), %%eax\n\t"
        :
        : "r" (daddr)
        : "eax"
    );

Determine memory access time in X86 assembly

1 Answers