Root-cause of the [segfault] is not OpenMP, but huge STACK allocations
You might have noticed, that the 900 was not a fixed treshold for going into [segfault] crashes.
Yet, lets show one of the several ways forward: put the data on HEAP, instead of leting it get allocated on STACK:
The TiO.RUN online run-verifiable demo is ready to click and run here ( Array sizes above ~ 8000 x 8000 x 8-B x 3-matrices start to get soft-killed by the platform, as runtimes grow beyond the online-platform demo quota of about 1 minute CPU-time ).
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <sys/time.h>
#define cTAM 1234
static int m3[cTAM][cTAM], // HEAP-allocated data does not devastate STACK-resources
m2[cTAM][cTAM], // HEAP-allocated data does not devastate STACK-resources
m1[cTAM][cTAM]; // HEAP-allocated data does not devastate STACK-resources
int main(int argc, char* argv[])
{
int tam = cTAM; // proxy for not accessible command-line parameter passing
// int first[ tam][tam];
// int second[tam][tam];
// int third[ tam][tam];
int i,j,k,l,f;
srand(time(NULL));
// ------------------------------------ // root-cause problem is NOT related to the art of OpenMP
// omp_set_num_threads(omp_get_num_procs());
for ( i = 0; i < tam; i++ )
for ( j = 0; j < tam; j++ )
{
l = rand();
f = rand();
// first[i][j] = l;
// second[i][j] = f;
m1[i][j] = l;
m2[i][j] = f;
}
// ------------------------------------ // root-cause problem is NOT related to the art of OpenMP
// #pragma omp parallel for private(i,j,k) shared(first,second,third)
for ( i = 0; i < tam; ++i ) {
for ( j = 0; j < tam; ++j ) {
for ( k = 0; k < tam; ++k ) {
// third[i][j] += first[i][k] * second[k][j];
m3[i][j] += ( m1[i][k]
+ m2[ k][j]
);
}
}
}
return( 1 );
}
The further OpenMP tricks are another subject, not related to [segfaults]
- refactor the indexing, so that better cache-line aligned processing re-uses row-wise data-elements from already fetched cache-line "neighbours" and best if shared access to third[][]-cells is completely avoided ( performance boosts will reward these efforts )
- performance will get boosted, if
first[][] ( or m1[][] above ) will become transposed, as the iteration will follow the row-wise memory-blocks already pre-fetched into cache-line ( data-access costs will drop from about ~1E2 [ns] down to ~0.5 [ns] on cache-hits, i.e. ~ 200 X )
Do not hesitate to read more and learn the needed technical details about cache-hierarchies, coalescing memory-access-patterns and about smart cache-line tricks to increase the cache-line re-use, instead of re-fetching the data that was once already fetched just by a poor order of index iterations. You had the same problem with the GPU-kernel code, so indeed worth to spend a few days on deep dive into this technology trick, as it works for you.
third[i][j]which is never to the sameiby any two threads at the same time. - Zulanthird[i][j] += ...will race. - Henri Menke