I'm observing a strange behavior and would like to know if it is Intel Xeon Phi related or not.
I have a little example code basically the matrix multiplication everyone knows (three nested for loops). I offload the computation to an Intel MIC with OpenMP 4.0 target
pragma and map the three matrices with map(to:A,B)
map(tofrom:C)
.
Now, what I am observing is that for small matrices e.g. 1024x1024 the memory transfer took extremely long. Compared to the native version (same code, same parallelisation strategy, just no offloading) the offload version consumes about 320ms more time. I did a warm-up run of the code to remove initialization overhead.
Compared to a Nvidia Tesla K20 where the same amount of memory is copied without noticing this 320ms are very bad.
Are there some environment settings that may improve the memory transfer speed?
An additionally question: I enabled offload reporting via the OFFLOAD_REPORT environment variable. What are the differences between the two timing results shown in the report:
[Offload] [HOST] [Tag 5] [CPU Time] 26.995279(seconds)
[Offload] [MIC 0] [Tag 5] [CPU->MIC Data] 3221225480 (bytes)
[Offload] [MIC 0] [Tag 5] [MIC Time] 16.859548(seconds)
[Offload] [MIC 0] [Tag 5] [MIC->CPU Data] 1073741824 (bytes)
What are those 10 seconds missing at MIC Time (memory transfer?)
Well a third question. Is it possible to used pinned memory with Intel MICs? If yes, how?