Speed of memcpy() greatly influenced by different ways of malloc()

Question

I wrote a program to test the speed of memcpy(). However, how memory are allocated greatly influences the speed.

CODE

#include<stdlib.h>
#include<stdio.h>
#include<sys/time.h>

void main(int argc, char *argv[]){
    unsigned char * pbuff_1;
    unsigned char * pbuff_2;
    unsigned long iters = 1000*1000;

    int type = atoi(argv[1]);
    int buff_size = atoi(argv[2])*1024;

    if(type == 1){ 
        pbuff_1 = (void *)malloc(2*buff_size);
        pbuff_2 = pbuff_1+buff_size;
    }else{
        pbuff_1 = (void *)malloc(buff_size);
        pbuff_2 = (void *)malloc(buff_size);
    }   

    for(int i = 0; i < iters; ++i){
        memcpy(pbuff_2, pbuff_1, buff_size);
    }   

    if(type == 1){ 
        free(pbuff_1);
    }else{
        free(pbuff_1);
        free(pbuff_2);
    }   
}

The OS is linux-2.6.35 and the compiler is GCC-4.4.5 with options "-std=c99 -O3".

Results on my computer(memcpy 4KB, iterate 1 million times):

time ./test.test 1 4

real    0m0.128s
user    0m0.120s
sys 0m0.000s

time ./test.test 0 4

real    0m0.422s
user    0m0.420s
sys 0m0.000s

This question is related with a previous question:

Why does the speed of memcpy() drop dramatically every 4KB?

UPDATE

The reason is related with GCC compiler, and I compiled and run this program with different versions of GCC:

GCC version--------4.1.3--------4.4.5--------4.6.3

Time Used(1)-----0m0.183s----0m0.128s----0m0.110s

Time Used(0)-----0m1.788s----0m0.422s----0m0.108s

It seems GCC is getting smarter.

Strange, I couldn't reproduce this result on gcc 4.6.3 (i'm getting ~0.400 on both cases). Are you sure you didn't switch the args and run over 1k? — Leeor
@Leeor I am sure it run over 4KB, and the same result only recurs over 4*i KB in my laptop as well as a server. Both two CPUs have a 32KB L1d 8-way associative cache. — foool
@Leeor I guess it's relevant with GCC compiler. when I update GCC version to 4.6.3, time of two cases are the same. — foool
very difficult to believe that "time ./test 0 4" is taking less time than "time ./test.test 1 4". My test shows otherwise with gcc 4.6.3 — Vikram Singh

Peter G. Peter G. · Accepted Answer · 2014-01-13T12:27:04

The specific addresses returned by malloc are selected by the implementation and not always optimal for the using code. You already know that the speed of moving memory around depends greatly on cache and page effects.

Here, the specific pointers malloced are not known. You could print them out using printf("%p", ptr). What is known however, is that using just one malloc for two blocks surely avoids page and cache waste between the two blocks. That may already be the reason for the speed difference.

Speed of memcpy() greatly influenced by different ways of malloc()

CODE

UPDATE

1 Answers