
I'm writing code using the C intrinsics for Intel's AVX instructions. If I have a packed double vector (a __m256d), what would be the most efficient way (i.e. the least number of operations) to store each of them to a different place in memory (i.e. I need to fan them out to different locations such that they are no longer packed)? Pseudocode:

__m256d *src;
double *dst;
int dst_dist;
dst[0] = src[0];
dst[dst_dist] = src[1];
dst[2 * dst_dist] = src[2];
dst[3 * dst_dist] = src[3];

Using SSE, I could do this with __m128 types using the _mm_storel_pi and _mm_storeh_pi intrinsics. I've not been able to find anything similar for AVX that allows me to store the individual 64-bit pieces to memory. Does one exist?

I think you wanted to use __m256d. __m256 is 8 floats.Norbert P.
Thanks, I fixed that. I missed it because I don't use __m256d; I'm actually using floats. The doubles that I want to extract and store are actually complex numbers (two floats, or the size of one double).Jason R

1 Answers


You can do it with a couple of extract instrinsics: (warning: untested)

 __m256d src = ...  //  data

__m128d a = _mm256_extractf128_pd(src, 0);
__m128d b = _mm256_extractf128_pd(src, 1);

_mm_storel_pd(dst + 0*dst_dist, a);
_mm_storeh_pd(dst + 1*dst_dist, a);
_mm_storel_pd(dst + 2*dst_dist, b);
_mm_storeh_pd(dst + 3*dst_dist, b);

What you want is the gather/scatter instructions in AVX2... But that's still a few years down the road.