Global load transaction count when in coalesced memory access

Question

I've created a simple kernel to test the coalesced memory access by observing the transaction counts, in nvidia gtx980 card. The kernel is,

__global__
void copy_coalesced(float * d_in, float * d_out)
{
    int tid = threadIdx.x + blockIdx.x*blockDim.x;

    d_out[tid] = d_in[tid];
}

When I run this with the following kernel configurations

#define BLOCKSIZE 32   

int data_size  = 10240;                  //always a multiply of the BLOCKSIZE
int gridSize   = data_size / BLOCKSIZE;

copy_coalesced<<<gridSize, BLOCKSIZE>>>(d_in, d_out);

Since the the data access in the kernel is fully coalasced, and since the data type is float (4 bytes), The number of Load/Store Transactions expected can be found as following,

Load Transaction Size = 32 bytes

Number of floats that can be loaded per transaction = 32 bytes / 4 bytes = 8

Number of transactions needed to load 10240 of data = 10240/8 = 1280 transactions

The same amount of transactions are expected for writing the data as well.

But when observing the nvprof metrics, following was the results

gld_transactions    2560
gst_transactions    1280

gld_transactions_per_request    8.0
gst_transactions_per_request    4.0

I cannot figure out why it takes twice the transactions that it needs for loading the data. But when it comes to load/store efficiency both the metrics gives out 100%

What am I missing out here?

cudaMalloc(&d_in, sizeof(float)*data_size); and cudaMalloc(&d_out, sizeof(float)*data_size); — BAdhi

Florent DUGUET Florent DUGUET · Accepted Answer · 2016-06-16T08:56:07

I reproduced your results on linux,

1  gld_transactions             Global Load Transactions               2560
1  gst_transactions             Global Store Transactions              1280
1  l2_tex_read_transactions     L2 Transactions (Texture Reads)        1280
1  l2_tex_write_transactions    L2 Transactions (Texture Writes)       1280

However, on Windows using NSIGHT Visual Studio edition, I get values that appear to be better:

You may want to contact NVIDIA as it could simply be a display issue in nvprof.

Global load transaction count when in coalesced memory access

1 Answers