I don't know if it was possible before, but CUDA 8.0 gives you a possibility to fine-tune caching for specific reads/writes. See PTX manual for details.
For example, to make this code always go to the main memory on read:
const float4 val = input[i];
you could write the following:
float4 val;
const float4* myinput = input+i;
asm("ld.global.cv.v4.f32 {%0, %1, %2, %3}, [%4];" : "=f"(val.x), "=f"(val.y), "=f"(val.z), "=f"(val.w) : "l"(myinput));
I managed to speed up one of my cache-intensive kernels by about 20% using non-cached reads and writes for data that was accessed only once by design