Strange behavior when detecting global memory

Question

After reading this question: "How to differentiate between pointers to shared and global memory?", I decided to try isspacep.local, isspacep.global and isspacep.shared in a simple test program.

The tests for local and shared memory work all the time, however the global memory test does not always work, for instance it did when the device code was compiled in debug mode (-G).

At first I thought that the compiler detected that I used a dummy vector for the global memory and handled it differently, so I used -Xcicc -O0 -Xptxas -O0 (cf. "Completely disable optimizations on NVCC"). If I compute with sm_30, global memory is detected correctly. However, if I compute with sm_20 or sm_21, global memory is not detected as such. Note that with -G, any sm >= 20 works.

Is there something that I am missing here? Is there an additional flag given to the compiler when using -G that could explain these differences?

Compilation

nvcc test_pointer.cu -arch=sm_20 -Xcicc -O0 -Xptxas -O0 -Xptxas -v -o test_pointer

Code

#include <stdio.h>
#include <cuda.h>

#define CUDA_CHECK_ERROR()  __cuda_check_errors(__FILE__, __LINE__)
#define CUDA_SAFE_CALL(err) __cuda_safe_call(err, __FILE__, __LINE__)

inline void __cuda_check_errors(const char *filename, const int line_number)
{
    cudaError err = cudaDeviceSynchronize();
    if(err != cudaSuccess)
    {
        printf("CUDA error %i at %s:%i: %s\n",
               err, filename, line_number, cudaGetErrorString(err));
        exit(-1);
    }
}

inline void __cuda_safe_call(cudaError err, const char *filename, const int line_number)
{
    if (err != cudaSuccess)
    {
        printf("CUDA error %i at %s:%i: %s\n",
               err, filename, line_number, cudaGetErrorString(err));
        exit(-1);
    }
}

__device__ unsigned int __isLocal(const void *ptr)
{
  unsigned int ret;
  asm volatile ("{ \n\t"
                "    .reg .pred p; \n\t"
                "    isspacep.local p, %1; \n\t"
                "    selp.u32 %0, 1, 0, p;  \n\t"
#if (defined(_MSC_VER) && defined(_WIN64)) || defined(__LP64__)
                "} \n\t" : "=r"(ret) : "l"(ptr));
#else
                "} \n\t" : "=r"(ret) : "r"(ptr));
#endif

  return ret;
}

__device__ unsigned int __isShared(const void *ptr)
{
  unsigned int ret;
  asm volatile ("{ \n\t"
                "    .reg .pred p; \n\t"
                "    isspacep.shared p, %1; \n\t"
                "    selp.u32 %0, 1, 0, p;  \n\t"
#if (defined(_MSC_VER) && defined(_WIN64)) || defined(__LP64__)
                "} \n\t" : "=r"(ret) : "l"(ptr));
#else
                "} \n\t" : "=r"(ret) : "r"(ptr));
#endif

  return ret;
}

__device__ void analyze_pointer(const void *ptr)
{
    printf("\t* is local:  %u\n", __isLocal(ptr));
    printf("\t* is global: %u\n", __isGlobal(ptr));
    printf("\t* is shared: %u\n", __isShared(ptr));
}

template <typename T, unsigned int N>
__global__ void test_kernel(T *vec)
{
    // Shared array
    __shared__ T shared_vec[10];

    // Register array
    T reg[10];

    if (blockIdx.x == 0 && threadIdx.x == 0)
    {
        printf("Register array:\n");
        analyze_pointer(&reg);

        printf("\nGlobal array:\n");
        analyze_pointer(vec);

        printf("\nShared array:\n");
        analyze_pointer(&shared_vec);
    }
}

int main()
{
    typedef float type_t;
    const unsigned int N = 128;

    type_t* d_vec;

    CUDA_SAFE_CALL(cudaMalloc(&d_vec, N * sizeof(type_t)));

    test_kernel<type_t, N><<<1, N>>>(d_vec);
    CUDA_CHECK_ERROR();

    CUDA_SAFE_CALL(cudaFree(d_vec));
}

Output

Register array:
    * is local:  1
    * is global: 0
    * is shared: 0

Global array:
    * is local:  0
    * is global: 0 (or 1 with -G or sm_30)
    * is shared: 0

Shared array:
    * is local:  0
    * is global: 0
    * is shared: 1

Hardware/software properties

This was tested with CUDA 5.0, GeForce GT 650M (CC 3.0), drivers 319.17 on Arch Linux 64-bit.

UPDATE #1

I just tested this code with a Tesla C2070 (CC 2.0) with the 304.88 drivers, CUDA 5.0 on Linux 64-bit, and it works. Global memory is detected when optimization is turned off, i.e. -arch=sm_20 -Xcicc -O0, or when an extra printf("\t* ptr = %ld\n", ptr); is added (cf. @RobertCrovella's comment). It does sound like a driver issue.

UPDATE #2

I made some more tests and here is what I get with my CC 3.0 device depending on how I compile the program:

-arch=sm_30                                               ---> undetected (probably optimized)
-arch=sm_30 -Xcicc -O0 -Xptxas -O0                        ---> OK
-arch=sm_30 -G                                            ---> OK
-arch=compute_30 -code=sm_30 -Xcicc -O0 -Xptxas -O0       ---> OK
-arch=compute_30 -code=sm_30 -G                           ---> OK
-arch=compute_30 -code=compute_30 -Xcicc -O0 -Xptxas -O0  ---> undetected
-arch=compute_30 -code=compute_30 -G                      ---> OK
-arch=sm_20                                               ---> undetected
-arch=sm_20 -Xcicc -O0 -Xptxas -O0                        ---> undetected
-arch=sm_20 -G                                            ---> OK
-arch=compute_20 -Xcicc -O0 -Xptxas -O0                   ---> undetected
-arch=compute_20 -G                                       ---> OK
-arch=compute_20 -code=sm_20 -Xcicc -O0 -Xptxas -O0       ---> runtime error (as expected)
-arch=compute_20 -code=sm_20 -G                           ---> runtime error (as expected)
-arch=compute_20 -code=compute_20 -Xcicc -O0 -Xptxas -O0  ---> undetected
-arch=compute_20 -code=compute_20 -G                      ---> OK
-arch=compute_20 -code=sm_30                              ---> undetected (probably optimized)
-arch=compute_20 -code=sm_30 -Xcicc -O0 -Xptxas -O0       ---> OK
-arch=compute_20 -code=sm_30 -G                           ---> OK

As a data point, I find that if I add this line of code at the beginning of your analyze_pointer function: printf("\t* ptr = %ld\n", ptr);, then I get different behavior, but I've been comparing ptx output of the various cases and I haven't been able to explain to myself what is going on yet. — Robert Crovella
@RobertCrovella: The additional printf() does not seem to change the output on my computer. Do you also use the 319.17 drivers? — BenC
No I'm using CUDA 5.0 on linux 64 bit with the 304.54 driver that ships with that package. I'm using a cc 2.0 device as well, and compiling with -arch=sm_20 — Robert Crovella
@RobertCrovella: I just tested with a Tesla C2070 (CC 2.0) with the 304.88 drivers on Linux 64bits, and it works (global memory detected with optimization turned off, i.e. -arch=sm_20 -Xcicc -O0). — BenC
I can reproduce this problem too with a 64b Linux and a K20c. — Jared Hoberock

BenC BenC · Accepted Answer · 2013-11-12T14:46:50

1

votes

This was apparently a bug in CUDA and the fix should be released with CUDA 6.0.