3
votes

I'm trying to use shared memory to cache things with OpenACC.

Basically what I'm working on is a matrix multiplication, and what I have is this:

typedef float ff; 

// Multiplies two square row-major matrices a and b, puts the result in c. 
void mmul(const restrict ff* a, 
          const restrict ff* b, 
          restrict ff* c, 
          const int n) { 
#pragma acc data copyin(a[0:n*n], b[0:n*n]) copy(c[0:n*n]) 
{ 

#pragma acc region 
{ 

#pragma acc loop independent vector(16) 
  for (int i = 0; i < n; ++i) { 
#pragma acc loop independent vector(16) 
    for (int j = 0; j < n; ++j) { 
      ff sum = 0; 
      for (int k = 0; k < n; ++k) { 
        sum += a[i + n * k] * b[k + n * j]; 
      } 
      c[i + n * j] = sum; 
    } 
  } 

} 
}
}

What I would like to do is use shared memory to cache tiles of the matrices 'a' and 'b' to use in the computation of 'c', in a similar fashion to what the CUDA mmul algorithm does.

Basically on CUDA I would know the exact size of my blocks, and would be able to:

  • declare a shared memory with the size of the block
  • copy the 'relevant' part of the data to the block
  • use this data

I understand I can use the

#pragma acc cached

directive, and that I can specify block sizes with the vector and gang options, but I'm having some trouble understanding how that's gonna be mapped to the CUDA architecture.

Is there a way to achieve something similar with OpenACC? Is there a good tutorial/resource on the use of the cached directive or on how to map some of the power of shared memory from CUDA to OpenACC?

1
The PGI accelerator compiler may be using shared memory already. Have you inspected the output with the -Minfo switch? This tutorial may be of interest. - Robert Crovella
Yes, but the Minfo switch only tells me HOW much shared memory my implementation is using. While this is useful, I was more interested in knowing if there's a way of explicitly manipulating such memory. Being able to see the high level cuda generated is very helpful though. - leo
@leo did you find an answer to your question? Were you able to explicitly define shared memory in OpenACC? - Millad

1 Answers

4
votes

If you are using PGI Accelerator Compiler, you can dump out the generated PTX file and see what is going on in underling of execution:

pgcc -acc -fast -Minfo -ta=nvidia,cc13,keepptx matrixMult.c -o matrixMult

The generated PTX will be stored in the current directory.

EDIT: You may prefer to see the high-level code (CUDA for C or Fortran). So use following -ta=nvidia,cc13,keepptx,keepgpu .