I have a function defined as
inline void vec_add(__m512d &v3, const __m512d &v1, const __m512d &v2) {
v3 = _mm512_add_pd(v1, v2);
}
(the __m512d
is a native data type mapping to SIMD registers on Intel MIC architecture)
As this function is rather short and gets invoked frequently, I'd like it to be inlined at every invocation. But Intel's compiler seems reluctant to inline this function, even after I use the -inline-forceinline
and -O3
options. It reports that 'Forceinline not honored for call ...' while compiling. As I have to use some compiler specific features, e.g. the __m512d
type, Intel compiler is my only option.
More Info:
The file structure is quite simple. The function vec_add
is defined in a header file mic.h
, which is included in another file test.cc
. Function vec_add
is just invoked repeatedly in a loop, and there're no function pointers involved. A simplified version of the code in test.cc
looks like this
for (int i = 0; i < LENGTH; i += 8) {
// a, b, c are arrays of doubles, and each SIMD register can hold 8 doubles
__mm512d va = _mm512_load_pd(a + i); // load SIMD register from memory
__mm512d vb = _mm512_load_pd(b + i); // ditto
__mm512d vc;
vec_add(vc, va, vb); // store SIMD register to memory
_mm512_store_pd(c + i, vc);
}
I've tried all kinds of hints, like __attribute__((always_inline))
,__forceinline
, and compiler option -inline-forceinline
, none of which worked yet.
Complete code
I've put all the relevant code together in a simplified form. You can try it out if you have a Intel compiler. Use option -Winline
to view inline reports and -inline-forceinline
to force inlining.
#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
#define LEN (1<<20)
__attribute((target(mic)))
inline void vec_add(__m512d &v3, const __m512d &v1, const __m512d &v2) {
v3 = _mm512_add_pd(v1, v2);
}
int main() {
#pragma offload target(mic)
{
double *a = (double*)_mm_malloc(LEN*sizeof(double), 64);
double *b = (double*)_mm_malloc(LEN*sizeof(double), 64);
double *c = (double*)_mm_malloc(LEN*sizeof(double), 64);
for (int i = 0; i < LEN; i++) {
a[i] = (double)rand()/RAND_MAX;
b[i] = (double)rand()/RAND_MAX;
}
for (int i = 0; i < LEN; i += 8) {
__m512d va = _mm512_load_pd(a + i);
__m512d vb = _mm512_load_pd(b + i);
__m512d vc;
vec_add(vc, va, vb);
_mm512_store_pd(c + i, vc);
}
_mm_free(a);
_mm_free(b);
_mm_free(c);
}
}
Configurations
- Compiler: Intel compiler(ICC) 14.0.2
- Compile options:
-O3 -inline-forceinline -Winline
Do you have any idea why this function can't be inlined? And how can I get it inlined after all(I don't want to turn to macros)?