SSE _mm_load_pd works while _mm_store_pd segfaults

Question

I am trying to learn the ropes of SSE intrinsics in C. I have a piece of code where I load a two-component vector of double data, add something to it and then attempt to store it back to memory. Everything works: I can load my data into SEE registers, I can operate on my data in those SSE registers, but the moment I attempt to write that processed data back to the original array (which is where I read my data from in the first place!) I get a segmentation fault.

Can anyone please advice me on this issue -- this is driving me insane.

double res[2] __attribute__((aligned(16)));

for(int k=0; k<n; k++){
int i=0;
for(; i+1<n; i+=2)
  {
    __m128d cik = _mm_load_pd(&C[i+k*n]);
    int j = 0;
    for(; j+1<n; j+=2)
      {
        __m128d aTij = _mm_load_pd(&A_T[j+i*n]);
        __m128d bjk = _mm_load_pd(&B[j+k*n]);
        __m128d dotpr = _mm_dp_pd(aTij, bjk,2);
        cik = _mm_add_pd(cik, dotpr);
      }
    _mm_store_pd(res, cik);
    //C[i+k*n] = res[0];
  }
}

As I say above, everything works in this code except for where I store my results back to that one-dimensional array "C" where I read my data from in the first place. That is, when I remove the comment signs in front of

//C[i+k*n] = res[0];

I get a segmentation fault.

How is it possible that I can read from C with the aligned memory version of _mm_load_pd (so C must be aligned in memory!) while writing back to it doesn't work? "C" must be aligned, and as you can see "res" must also be aligned.

Disclaimer: My original code read

_mm_store_pd(&C[i+k*n], cik);

which also produced a segmentation fault and I started introducing "res" with explicit alignment in my attempt to solve the problem.

Addendum

A, B, C are declared as follows:

buf = (double*) malloc (3 * nmax * nmax * sizeof(double));
double* A = buf + 0;
double* B = A + nmax*nmax;
double* C = B + nmax*nmax;

Attempted Solution with posix_memalign

In attempt to solve the segmentation fault issue when writing to the original one-dimensional array, I now use buffers for the corresponding matrices. However, this still segfauls when attempting to write back to C_buff!

double res[2] __attribute__((aligned(16)));

double * A_T;
posix_memalign((void**)&A_T, 16, n*n*sizeof(double));

double * B_buff;
posix_memalign((void**)&B_buff, 16, n*n*sizeof(double));

double * C_buff;
posix_memalign((void**)&C_buff, 16, n*n*sizeof(double));

for(int y=0; y<n; y++)
  for(int x=0; x<n; x++)
    A_T[x+y*n] = A[y+x*n];

for(int x=0; x<n; x++)
  for(int y=0; y<n; y++)
    B_buff[y+x*n] = B[y+x*n];

for(int x=0; x<n; x++)
  for(int y=0; y<n; y++)
    C_buff[y+x*n] = C[y+x*n];

for(int k=0; k<n; k++){
  int i=0;
  for(; i+1<n; i+=2)
    {
      __m128d cik = _mm_load_pd(&C_buff[i+k*n]);
      int j = 0;
      for(; j+1<n; j+=2)
        {
          __m128d aTij = _mm_load_pd(&A_T[j+i*n]);
          __m128d bjk = _mm_load_pd(&B_buff[j+k*n]);
          __m128d dotpr = _mm_dp_pd(aTij, bjk,2);
          cik = _mm_add_pd(cik, dotpr);
        }
      _mm_store_pd(&C_buff[i+k*n], cik);

  //_mm_store_pd(res, cik);
      //C_buff[i+k*n] = res[0];
  //C_buff[i+1+k*n] = res[1];
    }
}

@TonyTheLion please see addendum in question. As far as I understand, malloc attempts to align the chunk of memory it allocates but doesn't always succeed for all purposes. My main point of confusion about the above is that I can read from that particular location in "C" but can't write to it. So "C" appears aligned for the purpose of reading but not writing? — user2042696
I think that going by the assumption that malloc will align anything is iffy, you may want to use aligned_alloc if you're using GCC or _aligned_malloc if you're using MSVC. — Tony The Lion
@TonyTheLion thank you for your suggestion. Since I can't do anything about how A, B, C are allocated initially, I now copy their values to one-dimensional arrays that I allocate with aligned_malloc as you suggested. However, I still get a segmentation fault when attempting to write back to the buffer of C, C_buff. — user2042696
I mean if the code works with _mm_loadu_pd and _mm_storeu_pd then at least you know where the problem is. — user2088790

Salah Saleh Salah Saleh · Accepted Answer · 2015-03-16T15:49:21

When you remove the _mm_store_pd(&C_buff[i+k*n], cik); the whole loop is optimized and removed. The compiler deduce that the whole for loop doesn't lead to any meaningful work and remove it. That's why you don't get segmentation fault anymore.
I am sure the segmentation fault is because of the size of the array. Consider this simple program based on your example:

#include <stdio.h>
#include "emmintrin.h"

int main(){
int n = 15;
int y,x,k,i,j;

double * A;
posix_memalign((void**)&A, 16, n*n*sizeof(double));

double * B;
posix_memalign((void**)&B, 16, n*n*sizeof(double));

double * C;
posix_memalign((void**)&C, 16, n*n*sizeof(double));

for(y=0; y<n; y++)
  for(x=0; x<n; x++)
    A[x+y*n] = 0.1;

for(x=0; x<n; x++)
  for(y=0; y<n; y++)
    B[y+x*n] = 0.1;

for(x=0; x<n; x++)
  for( y=0; y<n; y++)
    C[y+x*n] = 0.1;

for( k=0; k<n; k++){
   i=0;
  for(; i+1<n; i+=2)
    {
      __m128d cik = _mm_load_pd(&C[i+k*n]);
       j = 0;
      for(; j+1<n; j+=2)
        {
          __m128d aTij = _mm_load_pd(&A[j+i*n]);
          __m128d bjk = _mm_load_pd(&B[j+k*n]);
          __m128d dotpr = _mm_add_pd(aTij, bjk);
          cik = _mm_add_pd(cik, dotpr);
        }
      _mm_store_pd(&C[i+k*n], cik);
    }
}
printf("C[15]: %f\n", C[15]);
printf("C[14]: %f\n", C[14]);

This gives a segmentation fault because n is odd. Now change the n=15 to n=16 and everything will run as expected. Hence, padding your arrays to an even number (or even better, to a size of a cache line -> 64 byte == 8 DP elements or 16 SP elements) will prevent such problems, and will lead to a better performance.

SSE _mm_load_pd works while _mm_store_pd segfaults

3 Answers