I am trying to learn the ropes of SSE intrinsics in C. I have a piece of code where I load a two-component vector of double data, add something to it and then attempt to store it back to memory. Everything works: I can load my data into SEE registers, I can operate on my data in those SSE registers, but the moment I attempt to write that processed data back to the original array (which is where I read my data from in the first place!) I get a segmentation fault.
Can anyone please advice me on this issue -- this is driving me insane.
double res[2] __attribute__((aligned(16)));
for(int k=0; k<n; k++){
int i=0;
for(; i+1<n; i+=2)
{
__m128d cik = _mm_load_pd(&C[i+k*n]);
int j = 0;
for(; j+1<n; j+=2)
{
__m128d aTij = _mm_load_pd(&A_T[j+i*n]);
__m128d bjk = _mm_load_pd(&B[j+k*n]);
__m128d dotpr = _mm_dp_pd(aTij, bjk,2);
cik = _mm_add_pd(cik, dotpr);
}
_mm_store_pd(res, cik);
//C[i+k*n] = res[0];
}
}
As I say above, everything works in this code except for where I store my results back to that one-dimensional array "C" where I read my data from in the first place. That is, when I remove the comment signs in front of
//C[i+k*n] = res[0];
I get a segmentation fault.
How is it possible that I can read from C with the aligned memory version of _mm_load_pd (so C must be aligned in memory!) while writing back to it doesn't work? "C" must be aligned, and as you can see "res" must also be aligned.
Disclaimer: My original code read
_mm_store_pd(&C[i+k*n], cik);
which also produced a segmentation fault and I started introducing "res" with explicit alignment in my attempt to solve the problem.
Addendum
A, B, C are declared as follows:
buf = (double*) malloc (3 * nmax * nmax * sizeof(double));
double* A = buf + 0;
double* B = A + nmax*nmax;
double* C = B + nmax*nmax;
Attempted Solution with posix_memalign
In attempt to solve the segmentation fault issue when writing to the original one-dimensional array, I now use buffers for the corresponding matrices. However, this still segfauls when attempting to write back to C_buff!
double res[2] __attribute__((aligned(16)));
double * A_T;
posix_memalign((void**)&A_T, 16, n*n*sizeof(double));
double * B_buff;
posix_memalign((void**)&B_buff, 16, n*n*sizeof(double));
double * C_buff;
posix_memalign((void**)&C_buff, 16, n*n*sizeof(double));
for(int y=0; y<n; y++)
for(int x=0; x<n; x++)
A_T[x+y*n] = A[y+x*n];
for(int x=0; x<n; x++)
for(int y=0; y<n; y++)
B_buff[y+x*n] = B[y+x*n];
for(int x=0; x<n; x++)
for(int y=0; y<n; y++)
C_buff[y+x*n] = C[y+x*n];
for(int k=0; k<n; k++){
int i=0;
for(; i+1<n; i+=2)
{
__m128d cik = _mm_load_pd(&C_buff[i+k*n]);
int j = 0;
for(; j+1<n; j+=2)
{
__m128d aTij = _mm_load_pd(&A_T[j+i*n]);
__m128d bjk = _mm_load_pd(&B_buff[j+k*n]);
__m128d dotpr = _mm_dp_pd(aTij, bjk,2);
cik = _mm_add_pd(cik, dotpr);
}
_mm_store_pd(&C_buff[i+k*n], cik);
//_mm_store_pd(res, cik);
//C_buff[i+k*n] = res[0];
//C_buff[i+1+k*n] = res[1];
}
}
mallocwill align anything is iffy, you may want to usealigned_allocif you're using GCC or_aligned_mallocif you're using MSVC. - Tony The LionA,B,Care allocated initially, I now copy their values to one-dimensional arrays that I allocate withaligned_mallocas you suggested. However, I still get a segmentation fault when attempting to write back to the buffer ofC,C_buff. - user2042696