2
votes

To parallelize my array based code I am trying to figure out how to utilize the Intel AVX intrinsics functions to perform parallel operations on large arrays.

From the documentation I have read that 256 bit AVX vectors will support up to 8 parallel 32 bit integers / 32 bit floats or up to 4 parallel 64 bit doubles. The float portion is giving me no issues and works fine, but the integer AVX functions are giving me a headache, let me use the following code to demonstrate:

The command line option -mavx is used in conjunction with an AVX compliant Intel processor. I will not be using AVX2 features. Compilation will be done using GNU99 C on Ubuntu 16.04.

AVX FP:

#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>

int main() 
{ 
    float data[8] = {1.f,2.f,3.f,4.f,5.f,6.f,7.f,8.f};
    __m256 points = _mm256_loadu_ps(&data[0]);

    for(int i = 0; i < 8; i++)
        printf("%f\n",points[i]);

    return 0;
}

Output:

1.000000
2.000000
3.000000
4.000000
5.000000
6.000000
7.000000
8.000000

This is exactly as it should be, however this is not the case when using the integer load AVX function:

AVX INT:

#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>

int main() 
{ 
    int data[8] = {1,2,3,4,5,6,7,8};
    __m256i points = _mm256_loadu_si256((__m256i *)&data[0]);

    for(int i = 0; i < 8; i++)
        printf("%d\n",points[i]);

    return 0;
}

Output:

1
3
5
7
1048576 [ out of bounds ]
0 [ out of bounds ]
1 [ out of bounds ]
3 [ out of bounds ]

As you can see the load only produces 4 elements in the __m256i type variable of which only the first, third, fifth and seventh element are loaded from the original array. Beyond the fourth element the reference goes out of bounds.

How do I produce the desired result of loading the entire data set in order into the integer AVX data type, much like the AVX floating point data type?

1

1 Answers

8
votes

You're using a GNU C extension to index a vector with [] instead of storing it back to an array. Intel's documentation for intrinsics has nothing to say about this, and not all compilers support it (e.g. MSVC doesn't).

GCC defines __m256i as a GNU C native vector of long long. <immintrin.h> doesn't define different __m256i types for SIMD vectors of int or short, and __m256i doesn't remember anything about where it came from / how it was set. (Unlike for FP vectors where there are separate C types for ps and pd, so you have to __m128d _mm_castps_pd(__m128) if you want to use shufpd or unpcklpd on a ps vector)

You can typedef native vector types like v8si yourself (see the previous link to gcc docs), or use a library like Agner Fog's VCL that gives you types like Vec8i (8 signed int) or Vec32uc (32 unsigned char). They have operator overloads that let you write a + b instead of _mm256_add_epi32(a, b) or _mm256_add_epi8(a,b) depending on type. Or use [] instead of _mm_extract_epi32 / epi8 / epi16 / epi64.


See print a __m128i variable for portable and safe/correct ways to loop over / print out the elements of an Intel intrinsic SIMD variable. TL:DR: _mm_store / _mm256_store to a tmp array and index that. It's portable, and it optimizes away (to a pextrd for integer or just a shuffle for FP), no actual store/reload in simple cases.