Use of shared memory with OpenACC

Question

I'm trying to use shared memory to cache things with OpenACC.

Basically what I'm working on is a matrix multiplication, and what I have is this:

typedef float ff; 

// Multiplies two square row-major matrices a and b, puts the result in c. 
void mmul(const restrict ff* a, 
          const restrict ff* b, 
          restrict ff* c, 
          const int n) { 
#pragma acc data copyin(a[0:n*n], b[0:n*n]) copy(c[0:n*n]) 
{ 

#pragma acc region 
{ 

#pragma acc loop independent vector(16) 
  for (int i = 0; i < n; ++i) { 
#pragma acc loop independent vector(16) 
    for (int j = 0; j < n; ++j) { 
      ff sum = 0; 
      for (int k = 0; k < n; ++k) { 
        sum += a[i + n * k] * b[k + n * j]; 
      } 
      c[i + n * j] = sum; 
    } 
  } 

} 
}
}

What I would like to do is use shared memory to cache tiles of the matrices 'a' and 'b' to use in the computation of 'c', in a similar fashion to what the CUDA mmul algorithm does.

Basically on CUDA I would know the exact size of my blocks, and would be able to:

declare a shared memory with the size of the block
copy the 'relevant' part of the data to the block
use this data

I understand I can use the

#pragma acc cached

directive, and that I can specify block sizes with the vector and gang options, but I'm having some trouble understanding how that's gonna be mapped to the CUDA architecture.

Is there a way to achieve something similar with OpenACC? Is there a good tutorial/resource on the use of the cached directive or on how to map some of the power of shared memory from CUDA to OpenACC?

The PGI accelerator compiler may be using shared memory already. Have you inspected the output with the -Minfo switch? This tutorial may be of interest. — Robert Crovella
Yes, but the Minfo switch only tells me HOW much shared memory my implementation is using. While this is useful, I was more interested in knowing if there's a way of explicitly manipulating such memory. Being able to see the high level cuda generated is very helpful though. — leo
@leo did you find an answer to your question? Were you able to explicitly define shared memory in OpenACC? — Millad

lashgar lashgar · Accepted Answer · 2012-10-17T08:34:42

If you are using PGI Accelerator Compiler, you can dump out the generated PTX file and see what is going on in underling of execution:

pgcc -acc -fast -Minfo -ta=nvidia,cc13,keepptx matrixMult.c -o matrixMult

The generated PTX will be stored in the current directory.

EDIT: You may prefer to see the high-level code (CUDA for C or Fortran). So use following -ta=nvidia,cc13,keepptx,keepgpu .

Use of shared memory with OpenACC

1 Answers