I have this very simple parallel code that I am using to learn openmp which is embarrassingly parallel. However, I don't get the superlinear or at least linear performance increase expected.
#pragma omp parallel num_threads(cores)
int id = omp_get_thread_num();
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, row, column, column, 1.0, MatrixA1[id], column, MatrixB[id], column, 0.0, Matrixmultiply[id], column);
On Visual studio using intel c++ compiler xe 15.0 and computing sgemm (matrix multiplication) for 288 by 288 matrices, i get 350microsecs for cores=1 and 1177microsecs for cores=4, which just seems like a sequential code. I set the Intel MKL property to Parallel (also tested with sequential) and Language settings to Generate Parallel Code (/Qopenmp). Anyway to improve this? I am running in a quad core haswell processor