After reading this question: "How to differentiate between pointers to shared and global memory?", I decided to try isspacep.local
, isspacep.global
and isspacep.shared
in a simple test program.
The tests for local and shared memory work all the time, however the global memory test does not always work, for instance it did when the device code was compiled in debug mode (-G
).
At first I thought that the compiler detected that I used a dummy vector for the global memory and handled it differently, so I used -Xcicc -O0 -Xptxas -O0
(cf. "Completely disable optimizations on NVCC"). If I compute with sm_30
, global memory is detected correctly. However, if I compute with sm_20
or sm_21
, global memory is not detected as such. Note that with -G
, any sm >= 20
works.
Is there something that I am missing here? Is there an additional flag given to the compiler when using -G
that could explain these differences?
Compilation
nvcc test_pointer.cu -arch=sm_20 -Xcicc -O0 -Xptxas -O0 -Xptxas -v -o test_pointer
Code
#include <stdio.h>
#include <cuda.h>
#define CUDA_CHECK_ERROR() __cuda_check_errors(__FILE__, __LINE__)
#define CUDA_SAFE_CALL(err) __cuda_safe_call(err, __FILE__, __LINE__)
inline void __cuda_check_errors(const char *filename, const int line_number)
{
cudaError err = cudaDeviceSynchronize();
if(err != cudaSuccess)
{
printf("CUDA error %i at %s:%i: %s\n",
err, filename, line_number, cudaGetErrorString(err));
exit(-1);
}
}
inline void __cuda_safe_call(cudaError err, const char *filename, const int line_number)
{
if (err != cudaSuccess)
{
printf("CUDA error %i at %s:%i: %s\n",
err, filename, line_number, cudaGetErrorString(err));
exit(-1);
}
}
__device__ unsigned int __isLocal(const void *ptr)
{
unsigned int ret;
asm volatile ("{ \n\t"
" .reg .pred p; \n\t"
" isspacep.local p, %1; \n\t"
" selp.u32 %0, 1, 0, p; \n\t"
#if (defined(_MSC_VER) && defined(_WIN64)) || defined(__LP64__)
"} \n\t" : "=r"(ret) : "l"(ptr));
#else
"} \n\t" : "=r"(ret) : "r"(ptr));
#endif
return ret;
}
__device__ unsigned int __isShared(const void *ptr)
{
unsigned int ret;
asm volatile ("{ \n\t"
" .reg .pred p; \n\t"
" isspacep.shared p, %1; \n\t"
" selp.u32 %0, 1, 0, p; \n\t"
#if (defined(_MSC_VER) && defined(_WIN64)) || defined(__LP64__)
"} \n\t" : "=r"(ret) : "l"(ptr));
#else
"} \n\t" : "=r"(ret) : "r"(ptr));
#endif
return ret;
}
__device__ void analyze_pointer(const void *ptr)
{
printf("\t* is local: %u\n", __isLocal(ptr));
printf("\t* is global: %u\n", __isGlobal(ptr));
printf("\t* is shared: %u\n", __isShared(ptr));
}
template <typename T, unsigned int N>
__global__ void test_kernel(T *vec)
{
// Shared array
__shared__ T shared_vec[10];
// Register array
T reg[10];
if (blockIdx.x == 0 && threadIdx.x == 0)
{
printf("Register array:\n");
analyze_pointer(®);
printf("\nGlobal array:\n");
analyze_pointer(vec);
printf("\nShared array:\n");
analyze_pointer(&shared_vec);
}
}
int main()
{
typedef float type_t;
const unsigned int N = 128;
type_t* d_vec;
CUDA_SAFE_CALL(cudaMalloc(&d_vec, N * sizeof(type_t)));
test_kernel<type_t, N><<<1, N>>>(d_vec);
CUDA_CHECK_ERROR();
CUDA_SAFE_CALL(cudaFree(d_vec));
}
Output
Register array:
* is local: 1
* is global: 0
* is shared: 0
Global array:
* is local: 0
* is global: 0 (or 1 with -G or sm_30)
* is shared: 0
Shared array:
* is local: 0
* is global: 0
* is shared: 1
Hardware/software properties
This was tested with CUDA 5.0, GeForce GT 650M (CC 3.0), drivers 319.17 on Arch Linux 64-bit.
UPDATE #1
I just tested this code with a Tesla C2070 (CC 2.0) with the 304.88 drivers, CUDA 5.0 on Linux 64-bit, and it works. Global memory is detected when optimization is turned off, i.e. -arch=sm_20 -Xcicc -O0
, or when an extra printf("\t* ptr = %ld\n", ptr);
is added (cf. @RobertCrovella's comment). It does sound like a driver issue.
UPDATE #2
I made some more tests and here is what I get with my CC 3.0 device depending on how I compile the program:
-arch=sm_30 ---> undetected (probably optimized)
-arch=sm_30 -Xcicc -O0 -Xptxas -O0 ---> OK
-arch=sm_30 -G ---> OK
-arch=compute_30 -code=sm_30 -Xcicc -O0 -Xptxas -O0 ---> OK
-arch=compute_30 -code=sm_30 -G ---> OK
-arch=compute_30 -code=compute_30 -Xcicc -O0 -Xptxas -O0 ---> undetected
-arch=compute_30 -code=compute_30 -G ---> OK
-arch=sm_20 ---> undetected
-arch=sm_20 -Xcicc -O0 -Xptxas -O0 ---> undetected
-arch=sm_20 -G ---> OK
-arch=compute_20 -Xcicc -O0 -Xptxas -O0 ---> undetected
-arch=compute_20 -G ---> OK
-arch=compute_20 -code=sm_20 -Xcicc -O0 -Xptxas -O0 ---> runtime error (as expected)
-arch=compute_20 -code=sm_20 -G ---> runtime error (as expected)
-arch=compute_20 -code=compute_20 -Xcicc -O0 -Xptxas -O0 ---> undetected
-arch=compute_20 -code=compute_20 -G ---> OK
-arch=compute_20 -code=sm_30 ---> undetected (probably optimized)
-arch=compute_20 -code=sm_30 -Xcicc -O0 -Xptxas -O0 ---> OK
-arch=compute_20 -code=sm_30 -G ---> OK
analyze_pointer
function:printf("\t* ptr = %ld\n", ptr);
, then I get different behavior, but I've been comparing ptx output of the various cases and I haven't been able to explain to myself what is going on yet. – Robert Crovellaprintf()
does not seem to change the output on my computer. Do you also use the 319.17 drivers? – BenC-arch=sm_20 -Xcicc -O0
). – BenC