Limits of parallelization regarding the core number

Question

I'm developing a parallel algorithm on a Intel i5-core machine, which has two cores, four threads.

n defines the size of the matrix, on which I perform my calculations on. As you can see from the table below there is almost 50% reduction from 1 thread to 2 threads utilization, but almost no difference between 2 threads and 4 threads. The numbers denote the seconds passed

#of threads - computation time

My compiler is mingw-gcc on windows platform. My parallelization tool is openmp. I'm defining number of threads by omp_set_num_threads(numThreads);in the beginning of the parallel routine.

I have no means to test the algorithm on a "real" 8 core machine. On my i5 machine, At 1 thread, task manager shows 25% of the total cpu power is used. At 2 threads, it's 50%, and at 4 threads, it's 96-99% as expected.

So what might be the reason for that situation? Why doesn't the computation time get halved?

The parallel code segment is to be found below:

    #pragma omp parallel for schedule(guided) shared(L,A) \
    private(i)
    for (i=k+1;i<row;i++){
        double dummy = 0;
        for (int nn=0;nn<k;nn++){
            dummy += L[i][nn]*L[k][nn];
            L[i][k] = (A[i][k] - dummy)/L[k][k];
        }
    }

You might be memory-limited. Or you might have two many dependencies between iterations. BTW, is this a Cholesky decomposition? — Oliver Charlesworth
Have you compared having independent/private arrays/thread vs shared? — gbulmer
@Oil Charlesworth Yes this is part of a Cholesky decomposition algorithm. But I don't think that I'm memory limited. when n=5000, the computer utilizes 362 MB of my 4 GB capacity. — Emre Turkoz
Roughly: if the values are already in cache most of the time, then using two threads instead of one won't give much benefit; similarly, if the bottleneck is memory access more threads won't help. Threads might help if there is spare memory bandwidth, and using the thread causes more values to get loaded into cache/unit time. — gbulmer
@Emre Turkoz - I think Oli is correct, there isn't enough computation happening, so you might be memory-bandwidth limited. But, I was thinking of "splitting" the array into 4 parts, i.e. accessing widely seperate parts, and and using a thread on each. If there is plenty of memory bandwidth and it is CPU bound (which I don't expect), then making the parts 'independent' might give the hardware the chance interleave memory loading better. I honestly think there is too little computation vs memory access, so I don't expect much change, but it might be worth a little test. — gbulmer

Mysticial Mysticial · Accepted Answer · 2012-04-07T18:55:33

Well, your machine has 2 cores and 4 threads.

You only have 2 cores, so you won't get 4x speedup from 1 - 4 threads.

Secondly, as you scale to more threads, you will likely start hitting resource contention such as maxing out your memory bandwidth.

Limits of parallelization regarding the core number

1 Answers