CUDA memory for lookup tables

Question

I'm designing a set of mathematical functions and implementing them in both CPU and GPU (with CUDA) versions.

Some of these functions are based upon lookup tables. Most of the tables take 4KB, some of them a bit more. The functions based upon lookup tables take an input, pick one or two entry of the lookup table and then compute the result by interpolating or applying similar techniques.

My question is now: where should I save these lookup tables? A CUDA device has many places for storing values (global memory, constant memory, texture memory,...). Provided that every table could be read concurrently by many threads and that the input values, and therefore the lookup indices, can be completely uncorrelated among the threads of every warp (resulting in uncorrelated memory accesses), which memory provides the fastest access?

I add that the contents of these tables are precomputed and completely constant.

EDIT

Just to clarify: I need to store about 10 different 4KB lookup tables. Anyway it would be great to know wether the solution as for this case would be the same for the case with e.g. 100 4KB tables or with e.g. 10 16KB lookup tables.

The constant cache is intended for the broadcast case, i.e. access across a warp is uniform. It will work if all threads in the warp access different locations but performance suffers. Shared memory is fast and is 48KB so is a good fit, but you may need it for other purposes, or your code is part of a library where that doesn't work well. If you can't use shared memory, I would suggest textures. It may be best not to use any tables on the GPU at all (see also CUDA math library), as FLOPS are increasing faster than memory bandwidth across GPU generations. — njuffa
Thanks, njuffa, for the clear explanation. My only question is about the shared memory. If I remember correctly, this memory is shared among threads in the same warp. So, should I replicate my tables on all warps? And will the tables be persistent after the termination of the kernel? — Spiros
Shared memory is shared between all threads in a thread block. So I am afraid with a total of 40KB of table storage your code would be limited to a single thread block per SM. In most circumstances it is better to have at least two thread blocks running, so you may want to consider using a mixed scheme where some tables are stored in shared memory (the ones with the most accesses) and others in texture memory. Texture memory also has the advantage that you can get (low-accuracy) linear interpolation for free. What kind of math functions are you implementing that need large tables? — njuffa
@njuffa: I cannot get into much detail here, as this regards ongoing research in my academic group, but I need to have bit-reproducible trascendental functions like sin plus some other less trivial functions. The problem here is that the set of functions provided by CUDA give values which are consistently different than those returned by any other CPU implementation. So, I'm implementing these and the only fast and accurate way seems to be to use lookup tables. Automatic interpolation provided by textures is a no-go, as I need full control over the interpolation. — Spiros
Bit reproducible transcendental functions are a very tall order, not even considering GPUs. My experience is that even use of code like fdlibm does not guarantee bit-wise identical answers due to different code generation by compilers. It is regrettable that you have to re-implement even standard math functions like sin(). For the CUDA math functions I purposefully avoided the common table+polynomial style algorithms since the use of tables would take away many constant and / or shared memory resources from user code, and floating-point operations scale better than memory access going forward. — njuffa

ShaneCook ShaneCook · Accepted Answer · 2013-07-16T15:07:04

Texture memory (now called read only data cache) would probably be a choice worth exploring, although not for the interpolation benefits. It supports 32 bit reads without reading beyond this amount. However, you're limited to 48K in total. For Kepler (compute 3.x) this is quite simple to program now.

Global memory, unless you configure it in 32 bit mode, will often drag in 128 bytes for each thread, hugely multiplying what is actually data needed from memory as you (apparently) can't coalesce the memory accesses. Thus the 32 bit mode is probably what you need if you want to use more than 48K (you mentioned 40K).

Thinking of coalescing, if you were to access a set of values in series from these tables, you might be able to interleave the tables such that these combinations could be grouped and read as a 64 or 128 bit read per thread. This would mean the 128 byte reads from global memory could be useful.

The problem you will have is that you're making the solution memory bandwidth limited by using lookup tables. Changing the L1 cache size (on Fermi / compute 2.x) to 48K will likely make a significant difference, especially if you're not using the other 32K of shared memory. Try texture memory and then global memory in 32 bit mode and see which works best for your algorithm. Finally pick a card with a good memory bandwidth figure if you have a choice over hardware.

CUDA memory for lookup tables

1 Answers