Why I get Present table dump for device[1]: NVIDIA Tesla GPU 0 in this code?

Question

Here is a header file DataHolder.h:

#ifndef DATAHOLDER_H
#define DATAHOLDER_H
using FloatingType=float;
int LIFE=0;
const int GL=2000000;
template <typename Floating> class DataHolder {
public:
  Floating particles[GL];
public:
  DataHolder(){}
 ~DataHolder(){}
 void Propagate();
 void InitParticle();
 };
template <typename Floating> void DataHolder<Floating>::Propagate()
{
#pragma acc parallel loop copy(LIFE) present(particles)
  for(int i=0; i<LIFE; ++i) Floating r0= particles[i];
}
template <typename Floating> void DataHolder<Floating>::InitParticle()
{
#pragma acc parallel num_gangs(1) vector_length(1) 
present(particles[0:GL]) copy(LIFE)
{
  particles[LIFE]=0.0f;
#pragma acc atomic update
  ++LIFE;
}
}
#endif//DATAHOLDER_H

I use it in the file main.cpp:

#include <iostream>
#include "DataHolder.h"
#include <accelmath.h>
#include <openacc.h>
#include <cuda.h>
#include <cuda_runtime.h>
int main(int argc, char **argv)
{
  DataHolder<FloatingType> * d;
  cudaMalloc((void**) & d, sizeof(DataHolder<FloatingType>));
  std::cout<<"sizeof(DataHolder<FloatingType>)=" 
  <<sizeof(DataHolder<FloatingType>)/1024/1024<<" MB"<<std::endl;  
  LIFE=0;
  int step=0;
  d->InitParticle();
  cudaFree(d);
}

The program compiles, but fails with:

sizeof(DataHolder)=7 MB hostptr=0x501520000,stride=1,size=2000000,eltsize=4,flags=0x200=present,async=-1,threadid=1 Present table dump for device[1]: NVIDIA Tesla GPU 0, compute capability 3.0, threadid=1 host:0x604b60 device:0x501ce0000 size:4 presentcount:1+0 line:26 name:LIFE allocated block device:0x501ce0000 size:512 thread:1 FATAL ERROR: data in PRESENT clause was not found on device 1: name=(null) host:0x501520000 file:/home/70-gaa/NFbuild_script_CHECK_GPU/ERROR/T3DataHolder.h _ZN10DataHolderIfE12InitParticleEv line:26

Why? What is wrong?

I compile the code for launching on GPU GeForce GTX 650 Ti with the compile line:

cmake . -DCMAKE_C_COMPILER=pgcc -DCMAKE_CXX_COMPILER=pgc++ - 
DCMAKE_CXX_FLAGS="-acc -mcmodel=medium =ta=tesla:cc30,managed -fast - 
Mcuda=cuda10.1 --c++11"

Use PGI 19.4 C++ compiler, gcc 5.3.1, OS Fedora 23 x86_64, CUDA 10.1, CUDA driver version 418.67.

Stack Overflow is a question-and-answer site for specific questions about actual code. You need to learn to debug your code. Questions of the form "This code doesn't work" are not on-topic on Stack Overflow and are likely to be downvoted. See How to debug small programs for more information. — L. F.

Mat Colgrove Mat Colgrove · Accepted Answer · 2019-08-06T14:36:37

The "present" clause check if the particular variable is present on the device but can only check variables managed by the OpenACC runtime. Here, you're allocating data via cudaMalloc which isn't managed by the runtime. In these cases, you should replace "present" with "deviceptr" to tell the runtime that is a CUDA device pointer. You'll need to add the "this" pointer since it's a device pointer as well.

However, your code will then seg fault on the host since you dereference "d" which is not a valid host pointer.

The simplest solution here is not use cudaMalloc and allocate "d" using "new". Since you're using CUDA Unified Memory, the data movement of "d" will be handled by the CUDA driver.

Why I get Present table dump for device[1]: NVIDIA Tesla GPU 0 in this code?

1 Answers