I have following working program which is producing results correctly however I am confused by some statistics. The setup is as:
- Hardware: Intel Xeon Phi processor 7210
- Software: Multiplication of two NxN matrices (in my case 512x512)
- Data Structures: All 3 matrices are malloc'ed in high bandwidth memory (i.e. in 16GB mcdram)
The code is:
void MatrixMultiply(img_in1, img_in2, img_out, myRank, nRanks)
{
for (int i=startingRow ; i<endingRow ; i++) //no of rows of first matrix divided on per process basis
{
for (int j=0 ; j<M1Cdim ; j++) //no of cols of first matrix
{
int temp = 0;
for(int k=0 ; k<M2Rdim ; k++) //no of rows of second matrix
{
temp += in1[i*M1Rdim+k] * in2[k*M2Rdim+j]; //out(i,j) += in1(i,k) * in2(k,j)
}
out[i*M1Rdim+j] = temp;
}
}
}
main()
{
for (int i = 1; i <= 10; i++)
{
MPI_Barrier(MPI_COMM_WORLD);
const double t0 = omp_get_wtime();
MatrixMultiply(img_in1, img_in2, img_out, myRank, nRanks);
MPI_Barrier(MPI_COMM_WORLD);
const double t1 = omp_get_wtime();
const double ts = t1-t0; // time in seconds
const double tms = ts*1.0e3; // time in milliseconds
const double gbps = double(sizeof(P)*2*img_in1.height*img_in1.width*img_in2.height)*1e-9/ts; // bandwidth in GB/s
const double fpps = double(2*img_in1.height*img_in1.width*img_in2.height)*1e-9/ts; // performance in GFLOP/s
if (myRank == 0)
{
printf("%5d %15.3f %15.3f %15.3f %s\n", i, tms, gbps, fpps);
}
}
}
The statistics I am getting are:
Step Time, ms GB/s GFLOP/s
1 2.306 116.408 116.408
2 2.334 115.017 115.017
3 2.297 116.855 116.855
4 2.295 116.964 116.964
5 16.692 16.082 16.082
6 11.468 23.407 23.407
7 2.299 116.758 116.758
8 2.291 117.171 117.171
9 2.295 116.964 116.964
10 10.792 24.874 24.874
So my question is:
Why iteration 5, 6 and 10 are showing worse results than other iterations?
My suspicion is that even though the data is placed in high bandwidth memory (mcdram) however code itself is executing from cache so might be getting hit. Although the overall program is quite small like 54KB but if running on shared server then some iterations might be evicted from instruction cache resulting in this performance penalty.