Best way to store 256 bit AVX vectors into unsigned long integers

Question

I was wondering what is the best way to store a 256 bit long AVX vectors into 4 64 bit unsigned long integers. According to the functions written in the website https://software.intel.com/sites/landingpage/IntrinsicsGuide/ I could only figure out using maskstore(code below) to do this. But is it the best way to do so? Or there exist other methods for this?

#include <immintrin.h>
#include <stdio.h>

int main() {

    unsigned long long int i,j;
    unsigned long long int bit[32][4];//256 bit random numbers
    unsigned long long int bit_out[32][4];//256 bit random numbers for test

    for(i=0;i<32;i++){ //load with 64 bit random integers
        for(j=0;j<4;j++){
            bit[i][j]=rand();
            bit[i][j]=bit[i][j]<<32 | rand();
        }
    }

//--------------------load masking-------------------------
    __m256i v_bit[32];
    __m256i mask;
    unsigned long long int mask_ar[4];
    mask_ar[0]=~(0UL);mask_ar[1]=~(0UL);mask_ar[2]=~(0UL);mask_ar[3]=~(0UL);
    mask = _mm256_loadu_si256 ((__m256i const *)mask_ar);
//--------------------load masking ends-------------------------

//--------------------------load the vectors-------------------
    for(i=0;i<32;i++){

        v_bit[i]=_mm256_loadu_si256 ((__m256i const *)bit[i]);

    }
//--------------------------load the vectors ends-------------------

//--------------------------extract from the vectors-------------------
    for(i=0;i<32;i++){

        _mm256_maskstore_epi64 (bit_out[i], mask, v_bit[i]);
    }
//--------------------------extract from the vectors end-------------------

    for(i=0;i<32;i++){ //load with 64 bit random integers
        for(j=0;j<4;j++){
            if(bit[i][j]!=bit_out[i][j])
                printf("----ERROR----\n");
        }
    }

  return 0;
}

Best way is not to. unsigned long is not guaranteed to have 64 bits. If you need a specific bitwidth (and encoding), use fixed-width types from stdint.h. — too honest for this site
Maybe you should have a look at the extract, set and insert intrinsics. I have no idea what you are trying to do. — Christoph Diegelmann
@Christoph I just want to extract 256 vector in in 4, 64 bit integers. I didn't find the intrinsics you mentioned in the above mentioned page. — Rick
If the destination 64 bit ints are contiguous then just use _mm256_storeu_si256. — Paul R
In C11, use _Alignas(32) unsigned long long int bit[32][4]; to get the compiler to align the stack memory for your array. This helps with performance even if you still use _mm256_storeu_ps. — Peter Cordes

Hossein Amiri Hossein Amiri · Accepted Answer · 2017-03-15T20:23:20

As other said in comments you do not need to use mask store in this case. the following loop got no error in your program

for(i=0;i<32;i++){
   _mm256_storeu_si256 ((__m256i const *) bit_out[i], v_bit[i]);

}

So the best instruction that you are looking for is _mm256_storeu_si256 this instruction stores a __m256i vector to unaligned address if your data are aligned you can use _mm256_store_si256. to see your vectors values you can use this function:

#include <stdalign.h>
alignas(32) unsigned long long int tempu64[4];
void printVecu64(__m256i vec)
{
    _mm256_store_si256((__m256i *)&tempu64[0], vec);
    printf("[0]= %u, [1]=%u, [2]=%u, [3]=%u \n\n", tempu64[0],tempu64[1],tempu64[2],tempu64[3]) ;

}

the _mm256_maskstore_epi64 let you choose the elements that you are going to store to the memory. This instruction is useful when you want to store a vector with more options to store an element to the memory or not change the memory value.

I was reading the Intel 64 and IA-32 Architectures Optimization Reference Manual (248966-032), 2016, page 410. and interestingly found out that unaligned store is still a performance killer.

11.6.3 Prefer Aligned Stores Over Aligned Loads

There are cases where it is possible to align only a subset of the processed data buffers. In these cases, aligning data buffers used for store operations usually yields better performance than aligning data buffers used for load operations. Unaligned stores are likely to cause greater performance degradation than unaligned loads, since there is a very high penalty on stores to a split cache-line that crosses pages. This penalty is estimated at 150 cycles. Loads that cross a page boundary are executed at retirement. In Example 11-12, unaligned store address can affect SAXPY performance for 3 unaligned addresses to about one quarter of the aligned case.

I shared here because some people said there are no differences between aligned/unaligned store except in debuging!

Best way to store 256 bit AVX vectors into unsigned long integers

1 Answers