I am trying to understand the gather functionality of AVX2 intel intrinsic.
As per the official document Link, the function definition is,
__m256i _mm256_i32gather_epi32 (int const* base_addr, __m256i vindex, const int scale)
Gather 32-bit integers from memory using 32-bit indices. 32-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.
Therfore as per my understanding, it returns a __m256i vector stuffed with 8 integers from the array with the base index base_addr
from the indexes(8) stuffed in vindex
. If any scale
is mentioned then that is also multiplied. Now, in order to test the understanding I wrote a code,
#include<stdio.h>
#include <immintrin.h>
int main()
{
__m256i var, ind_intel;
int * arr = (int *) aligned_alloc(sizeof(__m256i), sizeof(int) * 64);
int * out = (int *) aligned_alloc(sizeof(__m256i), sizeof(int) * 8);
int * ind = (int *) aligned_alloc(sizeof(__m256i), sizeof(int) * 8);
int i;
ind[0] = 0;ind[1] = 2;ind[2] = 4;ind[3] = 6;ind[4] = 8;ind[5] = 10;ind[6] = 12;ind[7] = 14;
ind_intel = _mm256_load_si256((__m256i *)&ind[0]);
for(i=0;i<64;i++)
arr[i] = i;
var = _mm256_i32gather_epi32(arr,ind_intel,1);
_mm256_store_si256((__m256i *)&out[0], var);
for(i=0;i<8;i++)
printf("%d ",out[i]);
return 0;
}
Now, the __m256i
variable ind_intel
gets the indices as 0,2,..,14
. The main array arr
is loaded with 0,1,..,63
. Therefore, gather should load the data as arr[0],arr[2],..,arr[14]
. But it is printing the value,
0 65536 1 131072 2 196608 3 262144
Definitely, I am missing something big. But I could not found any website or document clearly mentioning the usage of gather. Each one of them repeat the same description as of the official document. Can anyone explain the issue in the code and understanding?
N.B. The code is just for testing purpose.