1
votes

I'm currently working in a program that multiplies matrices in c, it receives the size of the matrices in the same line of execution,the program works for matrices below size 900 but when reaching matrices of more than size 900 I'm receiving the error segmentation fault(core dumped)

After checking many ressources available on internet I'm still unable to solve the problem, here is the code that I'm using:

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <sys/time.h>

int main(int argc, char* argv[]) 
{
    int tam = atoi(argv[1]);
    int first[tam][tam];
    int second[tam][tam];
    int third[tam][tam];
    int i,j,k,l,f;
    srand(time(NULL));
    omp_set_num_threads(omp_get_num_procs());
    for (i= 0; i< tam; i++)
        for (j= 0; j< tam; j++)
    {
            l = rand();
            f = rand();
            first[i][j] = l;
            second[i][j] = f;
    }
    #pragma omp parallel for private(i,j,k) shared(first,second,third)
    for (i = 0; i < tam; ++i) {
        for (j = 0; j < tam; ++j) {
            for (k = 0; k < tam; ++k) {
                third[i][j] += first[i][k] * second[k][j];
            }
        }
    }

}

I would really appreciate any help and thank you for taking the time to read my question.

1
Have you used a debugger to see exactly where it seg faults? I'm not sure, but maybe OpenMP doesn't like variable length arrays? - simon
You have a race condition. - Henri Menke
@HenriMenke where is there a race condition? I don't see one. The only write is to third[i][j] which is never to the same i by any two threads at the same time. - Zulan
@Zulan Okay, I'm actually not entirely sure but if OpenMP collapses the three nested loops (which I don't know if it happens by default), the third[i][j] += ... will race. - Henri Menke
@HenriMenke OpenMP does not collapse loops by default. - Zulan

1 Answers

0
votes

Root-cause of the [segfault] is not OpenMP, but huge STACK allocations

You might have noticed, that the 900 was not a fixed treshold for going into [segfault] crashes.

Yet, lets show one of the several ways forward: put the data on HEAP, instead of leting it get allocated on STACK:

The TiO.RUN online run-verifiable demo is ready to click and run here ( Array sizes above ~ 8000 x 8000 x 8-B x 3-matrices start to get soft-killed by the platform, as runtimes grow beyond the online-platform demo quota of about 1 minute CPU-time ).

#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <sys/time.h>

#define             cTAM 1234
static int m3[cTAM][cTAM],              // HEAP-allocated data does not devastate STACK-resources
           m2[cTAM][cTAM],              // HEAP-allocated data does not devastate STACK-resources
           m1[cTAM][cTAM];              // HEAP-allocated data does not devastate STACK-resources

int main(int argc, char* argv[]) 
{
    int             tam = cTAM;         // proxy for not accessible command-line parameter passing
//  int first[ tam][tam];
//  int second[tam][tam];
//  int third[ tam][tam];

    int i,j,k,l,f;

    srand(time(NULL));
// ------------------------------------ // root-cause problem is NOT related to the art of OpenMP
//  omp_set_num_threads(omp_get_num_procs());
    for     ( i = 0; i < tam; i++ )
        for ( j = 0; j < tam; j++ )
    {
            l = rand();
            f = rand();
        //  first[i][j] = l;
        //  second[i][j] = f;
            m1[i][j] = l;
            m2[i][j] = f;
    }
// ------------------------------------ // root-cause problem is NOT related to the art of OpenMP
//  #pragma omp parallel for private(i,j,k) shared(first,second,third)
    for         ( i = 0; i < tam; ++i ) {
        for     ( j = 0; j < tam; ++j ) {
            for ( k = 0; k < tam; ++k ) {
            //   third[i][j] += first[i][k] * second[k][j];
                 m3[i][j]    += ( m1[i][k]
                                + m2[   k][j]
                                  );
            }
        }
    }
    return( 1 );
}

The further OpenMP tricks are another subject, not related to [segfaults]

  • refactor the indexing, so that better cache-line aligned processing re-uses row-wise data-elements from already fetched cache-line "neighbours" and best if shared access to third[][]-cells is completely avoided ( performance boosts will reward these efforts )
  • performance will get boosted, if first[][] ( or m1[][] above ) will become transposed, as the iteration will follow the row-wise memory-blocks already pre-fetched into cache-line ( data-access costs will drop from about ~1E2 [ns] down to ~0.5 [ns] on cache-hits, i.e. ~ 200 X )

Do not hesitate to read more and learn the needed technical details about cache-hierarchies, coalescing memory-access-patterns and about smart cache-line tricks to increase the cache-line re-use, instead of re-fetching the data that was once already fetched just by a poor order of index iterations. You had the same problem with the GPU-kernel code, so indeed worth to spend a few days on deep dive into this technology trick, as it works for you.