1
votes

Surface memory is the write-only analogue to the texture cache in CUDA.

I've found NVIDIA GPU peak bandwidth numbers in academic literature for reading from global memory and shared memory. However, I've found less information on the write throughput of CUDA memory devices.

In particular, I'm interested in the bandwidth (and latency too, if known) of the CUDA surface memory on Fermi and Kepler GPUs.

  • Are there benchmarking numbers on this?
  • If not, then how might I implement a benchmark for to measure the bandwidth of writing to surface memory?
2

2 Answers

2
votes

According to Device Memory Accesses,

  • On a cache miss: a texture fetch or surface read costs one global memory read from device memory;
  • On a cache hit: it reduces global mem bandwidth demand but not fetch latency.

Since latencies of texture/surface/global mem are almost the same, and all of them locate on off-chip DRAM, I think the peak bandwidth of surface mem is same to global mem indicated in the GPU specs.

In order to timing the latency, the paper you referenced may use only one thread. So it's easy to calculate the latency by

global mem read latency = total read time / number of read

You could implement your timing on surface write in a similar fashion. But I don't think it is reasonable to apply this method on shared mem latency measurement as shown in that paper, since the overhead of the for loop may not be ignored compared to the shared mem latency.

2
votes

On compute capability 2.x and 3.x devices surface writes go through the L1 cache and have the same throughput and latency as global writes.