9
votes

I am trying to learn the ropes of SSE intrinsics in C. I have a piece of code where I load a two-component vector of double data, add something to it and then attempt to store it back to memory. Everything works: I can load my data into SEE registers, I can operate on my data in those SSE registers, but the moment I attempt to write that processed data back to the original array (which is where I read my data from in the first place!) I get a segmentation fault.

Can anyone please advice me on this issue -- this is driving me insane.

double res[2] __attribute__((aligned(16)));

for(int k=0; k<n; k++){
int i=0;
for(; i+1<n; i+=2)
  {
    __m128d cik = _mm_load_pd(&C[i+k*n]);
    int j = 0;
    for(; j+1<n; j+=2)
      {
        __m128d aTij = _mm_load_pd(&A_T[j+i*n]);
        __m128d bjk = _mm_load_pd(&B[j+k*n]);
        __m128d dotpr = _mm_dp_pd(aTij, bjk,2);
        cik = _mm_add_pd(cik, dotpr);
      }
    _mm_store_pd(res, cik);
    //C[i+k*n] = res[0];
  }
}

As I say above, everything works in this code except for where I store my results back to that one-dimensional array "C" where I read my data from in the first place. That is, when I remove the comment signs in front of

//C[i+k*n] = res[0];

I get a segmentation fault.

How is it possible that I can read from C with the aligned memory version of _mm_load_pd (so C must be aligned in memory!) while writing back to it doesn't work? "C" must be aligned, and as you can see "res" must also be aligned.

Disclaimer: My original code read

_mm_store_pd(&C[i+k*n], cik);

which also produced a segmentation fault and I started introducing "res" with explicit alignment in my attempt to solve the problem.

Addendum

A, B, C are declared as follows:

buf = (double*) malloc (3 * nmax * nmax * sizeof(double));
double* A = buf + 0;
double* B = A + nmax*nmax;
double* C = B + nmax*nmax;

Attempted Solution with posix_memalign

In attempt to solve the segmentation fault issue when writing to the original one-dimensional array, I now use buffers for the corresponding matrices. However, this still segfauls when attempting to write back to C_buff!

double res[2] __attribute__((aligned(16)));

double * A_T;
posix_memalign((void**)&A_T, 16, n*n*sizeof(double));

double * B_buff;
posix_memalign((void**)&B_buff, 16, n*n*sizeof(double));

double * C_buff;
posix_memalign((void**)&C_buff, 16, n*n*sizeof(double));

for(int y=0; y<n; y++)
  for(int x=0; x<n; x++)
    A_T[x+y*n] = A[y+x*n];

for(int x=0; x<n; x++)
  for(int y=0; y<n; y++)
    B_buff[y+x*n] = B[y+x*n];

for(int x=0; x<n; x++)
  for(int y=0; y<n; y++)
    C_buff[y+x*n] = C[y+x*n];

for(int k=0; k<n; k++){
  int i=0;
  for(; i+1<n; i+=2)
    {
      __m128d cik = _mm_load_pd(&C_buff[i+k*n]);
      int j = 0;
      for(; j+1<n; j+=2)
        {
          __m128d aTij = _mm_load_pd(&A_T[j+i*n]);
          __m128d bjk = _mm_load_pd(&B_buff[j+k*n]);
          __m128d dotpr = _mm_dp_pd(aTij, bjk,2);
          cik = _mm_add_pd(cik, dotpr);
        }
      _mm_store_pd(&C_buff[i+k*n], cik);

  //_mm_store_pd(res, cik);
      //C_buff[i+k*n] = res[0];
  //C_buff[i+1+k*n] = res[1];
    }
}
3
@TonyTheLion please see addendum in question. As far as I understand, malloc attempts to align the chunk of memory it allocates but doesn't always succeed for all purposes. My main point of confusion about the above is that I can read from that particular location in "C" but can't write to it. So "C" appears aligned for the purpose of reading but not writing? - user2042696
I think that going by the assumption that malloc will align anything is iffy, you may want to use aligned_alloc if you're using GCC or _aligned_malloc if you're using MSVC. - Tony The Lion
@TonyTheLion thank you for your suggestion. Since I can't do anything about how A, B, C are allocated initially, I now copy their values to one-dimensional arrays that I allocate with aligned_malloc as you suggested. However, I still get a segmentation fault when attempting to write back to the buffer of C, C_buff. - user2042696
are you writing within the bounds of the array? - Tony The Lion
I mean if the code works with _mm_loadu_pd and _mm_storeu_pd then at least you know where the problem is. - user2088790

3 Answers

1
votes

When you remove the _mm_store_pd(&C_buff[i+k*n], cik); the whole loop is optimized and removed. The compiler deduce that the whole for loop doesn't lead to any meaningful work and remove it. That's why you don't get segmentation fault anymore.
I am sure the segmentation fault is because of the size of the array. Consider this simple program based on your example:

#include <stdio.h>
#include "emmintrin.h"

int main(){
int n = 15;
int y,x,k,i,j;

double * A;
posix_memalign((void**)&A, 16, n*n*sizeof(double));

double * B;
posix_memalign((void**)&B, 16, n*n*sizeof(double));

double * C;
posix_memalign((void**)&C, 16, n*n*sizeof(double));

for(y=0; y<n; y++)
  for(x=0; x<n; x++)
    A[x+y*n] = 0.1;

for(x=0; x<n; x++)
  for(y=0; y<n; y++)
    B[y+x*n] = 0.1;

for(x=0; x<n; x++)
  for( y=0; y<n; y++)
    C[y+x*n] = 0.1;

for( k=0; k<n; k++){
   i=0;
  for(; i+1<n; i+=2)
    {
      __m128d cik = _mm_load_pd(&C[i+k*n]);
       j = 0;
      for(; j+1<n; j+=2)
        {
          __m128d aTij = _mm_load_pd(&A[j+i*n]);
          __m128d bjk = _mm_load_pd(&B[j+k*n]);
          __m128d dotpr = _mm_add_pd(aTij, bjk);
          cik = _mm_add_pd(cik, dotpr);
        }
      _mm_store_pd(&C[i+k*n], cik);
    }
}
printf("C[15]: %f\n", C[15]);
printf("C[14]: %f\n", C[14]);

This gives a segmentation fault because n is odd. Now change the n=15 to n=16 and everything will run as expected. Hence, padding your arrays to an even number (or even better, to a size of a cache line -> 64 byte == 8 DP elements or 16 SP elements) will prevent such problems, and will lead to a better performance.

0
votes

Even with the __attribute__((aligned(32))) , I was getting the same error(it had %50 chance of mis-alingment). Then I used the following function to get a %100 chance of alignment(a should be power of two):

void * malloc_float_align(size_t n, unsigned int a/*alignment*/, float *& output)
{
    void * adres=NULL;
    void * adres2=NULL;
    adres=malloc(n*sizeof(float)+a);
    size_t adr=(size_t)adres;
    size_t adr2=adr+a-(adr&(a-1u)); // a valid address for a alignment
    adres2=(void * ) adr2;
    output=(float *)adres2;
    return adres;                //pointer to be used in free()
}

Then usage in main:

int main()
{


  float * res=NULL;
  void * origin=malloc_float_align(1024,32u,res);
  //use res for sse/avx
  free(origin); // actual allocation is more than 1024 elements
  return 0;
}

Ofcourse this is in c++ so you can make it work with only changing some function parameter style.

0
votes

A simple trick would be to perform an ASSERT and see if it triggers:

ASSERT( ((size_t)(&C_buff[i+k*n]) & 0xF) == 0);

The ASSERT will fire when the address isn't SSE aligned. 64 bit builds should provide 16B alignment by default. If you plan to have 32bit code, use one of the above align_malloc functions. You'll need to use the relevant align_free or you'll crash.