To parallelize my array based code I am trying to figure out how to utilize the Intel AVX intrinsics functions to perform parallel operations on large arrays.
From the documentation I have read that 256 bit AVX vectors will support up to 8 parallel 32 bit integers / 32 bit floats or up to 4 parallel 64 bit doubles. The float portion is giving me no issues and works fine, but the integer AVX functions are giving me a headache, let me use the following code to demonstrate:
The command line option -mavx is used in conjunction with an AVX compliant Intel processor. I will not be using AVX2 features. Compilation will be done using GNU99 C on Ubuntu 16.04.
AVX FP:
#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
int main()
{
float data[8] = {1.f,2.f,3.f,4.f,5.f,6.f,7.f,8.f};
__m256 points = _mm256_loadu_ps(&data[0]);
for(int i = 0; i < 8; i++)
printf("%f\n",points[i]);
return 0;
}
Output:
1.000000
2.000000
3.000000
4.000000
5.000000
6.000000
7.000000
8.000000
This is exactly as it should be, however this is not the case when using the integer load AVX function:
AVX INT:
#include <stdio.h>
#include <stdlib.h>
#include <immintrin.h>
int main()
{
int data[8] = {1,2,3,4,5,6,7,8};
__m256i points = _mm256_loadu_si256((__m256i *)&data[0]);
for(int i = 0; i < 8; i++)
printf("%d\n",points[i]);
return 0;
}
Output:
1
3
5
7
1048576 [ out of bounds ]
0 [ out of bounds ]
1 [ out of bounds ]
3 [ out of bounds ]
As you can see the load only produces 4 elements in the __m256i type variable of which only the first, third, fifth and seventh element are loaded from the original array. Beyond the fourth element the reference goes out of bounds.
How do I produce the desired result of loading the entire data set in order into the integer AVX data type, much like the AVX floating point data type?