How do I direct all accesses to global memory in CUDA?

Question

I want all accesses from my program to access global memory (even if the data is found in the L1/L2 cache). To this effect I found out that L1 cache can be skipped by passing these options to nvcc compiler:

-Xptxas -dlcm=cg

CUDA documentation states this:

.cv Cache as volatile (consider cached system memory lines stale, fetch again).

So, I am assuming when I run with either -dlcm=cg or -dlcm=cv, the PTX file generated should be different from the one that is generated normally. (The loads should be appended with either .cg or .cv)

My sample program:

__global__ void rh_kernel(int *datainRowX, int *datainRowY) {
    int tid = blockIdx.x * blockDim.x + threadIdx.x;
    if (tid != 0)
        return;
    int i, x, y;
    x = datainRowX[1];
    y = datainRowY[2];
    datainRowX[0] = x + y;
}

int main(int argc, char** argv) {
    int* d_datainRowX;
    cudaMalloc((void**)&d_datainRowX, sizeof(int) * 268435456);

    int* d_datainRowY;
    cudaMalloc((void**)&d_datainRowY, sizeof(int) * 268435456);

    rh_kernel<<<1024, 1>>>(d_datainRowX, d_datainRowY);
    cudaFree(d_datainRowX); cudaFree(d_datainRowY);
    return(0);
}

I notice that whatever options I pass to the nvcc compiler ("-Xptxas -dlcm=cg" or "-Xptxas -dlcm=cv" or nothing), in all the three cases the PTX generated is the same. I am using -ptx option to generate the PTX file. What am I missing? Is there any other way to achieve what I am doing? Thanks in advance for your time.

pSoLT pSoLT · Accepted Answer · 2017-01-16T09:12:09

According to Cuda Toolkit Documentation:

L1 caching in Kepler GPUs is reserved only for local memory accesses, such as register spills and stack data. Global loads are cached in L2 only (or in the Read-Only Data Cache).

GK110B-based products such as the Tesla K40 GPU Accelerator, GK20A, and GK210 retain this behavior by default

L1 cache is not used in global memory reads on Kepler by default . Thus - there is no difference in PTX when you add -Xptxas -dlcm=cg.

Disabling L2 cache is not possbile.

How do I direct all accesses to global memory in CUDA?

1 Answers