12
votes

I have a function defined as

inline void vec_add(__m512d &v3, const __m512d &v1, const __m512d &v2) {
    v3 = _mm512_add_pd(v1, v2);
}

(the __m512d is a native data type mapping to SIMD registers on Intel MIC architecture)

As this function is rather short and gets invoked frequently, I'd like it to be inlined at every invocation. But Intel's compiler seems reluctant to inline this function, even after I use the -inline-forceinline and -O3 options. It reports that 'Forceinline not honored for call ...' while compiling. As I have to use some compiler specific features, e.g. the __m512d type, Intel compiler is my only option.

More Info:

The file structure is quite simple. The function vec_add is defined in a header file mic.h, which is included in another file test.cc. Function vec_add is just invoked repeatedly in a loop, and there're no function pointers involved. A simplified version of the code in test.cc looks like this

for (int i = 0; i < LENGTH; i += 8) {
    // a, b, c are arrays of doubles, and each SIMD register can hold 8 doubles
    __mm512d va = _mm512_load_pd(a + i); // load SIMD register from memory
    __mm512d vb = _mm512_load_pd(b + i); // ditto
    __mm512d vc;
    vec_add(vc, va, vb); // store SIMD register to memory
    _mm512_store_pd(c + i, vc);
}

I've tried all kinds of hints, like __attribute__((always_inline)),__forceinline, and compiler option -inline-forceinline, none of which worked yet.

Complete code

I've put all the relevant code together in a simplified form. You can try it out if you have a Intel compiler. Use option -Winline to view inline reports and -inline-forceinline to force inlining.

#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>

#define LEN (1<<20)

__attribute((target(mic)))
inline void vec_add(__m512d &v3, const __m512d &v1, const __m512d &v2) {
    v3 = _mm512_add_pd(v1, v2);
}

int main() {
    #pragma offload target(mic)
    {
        double *a = (double*)_mm_malloc(LEN*sizeof(double), 64);
        double *b = (double*)_mm_malloc(LEN*sizeof(double), 64);
        double *c = (double*)_mm_malloc(LEN*sizeof(double), 64);

        for (int i = 0; i < LEN; i++) {
            a[i] = (double)rand()/RAND_MAX;
            b[i] = (double)rand()/RAND_MAX;
        }

        for (int i = 0; i < LEN; i += 8) {
            __m512d va = _mm512_load_pd(a + i);
            __m512d vb = _mm512_load_pd(b + i);
            __m512d vc;
            vec_add(vc, va, vb);
            _mm512_store_pd(c + i, vc);
        }

        _mm_free(a);
        _mm_free(b);
        _mm_free(c);
    }
}

Configurations

  • Compiler: Intel compiler(ICC) 14.0.2
  • Compile options: -O3 -inline-forceinline -Winline

Do you have any idea why this function can't be inlined? And how can I get it inlined after all(I don't want to turn to macros)?

1
Are you perchance taking the address of the function somewhere?Frédéric Hamidi
Do you call the function in the same module?urzeit
Have you checked the assembly code if there is really a jump to your function?MikeMB
@MikeMB No I haven't checked the assembly. But I've tried converting this function to a macro, and got a noticeable performance boost. So I'm rather sure the function is not inlined.lei_z
@lei.april That sounds reasonable, which unfortunately means that I've no Idea, why the compiler doesn't want to inline the function. However, as you're already using compiler specific types in the function interfaces, I wonder, why you want to put the call to _mm512_add_pd inside a function in the first place?MikeMB

1 Answers

9
votes

For some reason the Intel Compiler doesn't do inlining of functions in offloaded code (I'm not all that familiar with the concept, so I don't know what the technical reason for this is). See effective-use-of-the-intel-compilers-offload-features for more information (just search for "inline").

Quoting from the linked article:

Function Inlining into Offload Constructs

Sometimes inlining a function is necessary for optimum performance of the generated code. Functions called directly within a #pragma offload are not inlined by the compiler even if they are marked as inline. To enable optimum performance of code in offload regions, either manually inline functions, or place the entire offload construct into its own function.

...

One solution is to manually inline function f, as shown in function v2.

Another solution is to move the offload construct into its own function as shown in function v3.

If I understand this correctly, the best thing to do for you would be to place the loops into a separate function which is also marked with __attribute((target(mic))).