I'm designing a set of mathematical functions and implementing them in both CPU and GPU (with CUDA) versions.
Some of these functions are based upon lookup tables. Most of the tables take 4KB, some of them a bit more. The functions based upon lookup tables take an input, pick one or two entry of the lookup table and then compute the result by interpolating or applying similar techniques.
My question is now: where should I save these lookup tables? A CUDA device has many places for storing values (global memory, constant memory, texture memory,...). Provided that every table could be read concurrently by many threads and that the input values, and therefore the lookup indices, can be completely uncorrelated among the threads of every warp (resulting in uncorrelated memory accesses), which memory provides the fastest access?
I add that the contents of these tables are precomputed and completely constant.
EDIT
Just to clarify: I need to store about 10 different 4KB lookup tables. Anyway it would be great to know wether the solution as for this case would be the same for the case with e.g. 100 4KB tables or with e.g. 10 16KB lookup tables.