INTRODUCTION
In this question we can learn how to disable L1 cache for one single variable. Here is the accepted answer:
As mentioned above you can use inline PTX, here is an example:
__device__ __inline__ double ld_gbl_cg(const double *addr) {
double return_value;
asm("ld.global.cg.f64 %0, [%1];" : "=d"(return_value) : "l"(addr));
return return_value;
}
You can easily vary this by swapping .f64 for .f32 (float) or .s32 (int) etc., the constraint of return_value "=d" for "=d" (float) or "=r" (int) etc. Note that the last constraint before (addr) - "l" - denotes 64 bit addressing, if you are using 32 bit addressing, it should be "r".
Now, I however, I want to load a boolean (1 byte) not a floating point. So, I thought that I could do something like this (for architecture >=sm_20):
__device__ inline bool ld_gbl_cg(const bool* addr){
bool return_value;
asm("ld.global.cg.u8 %0, [%1];" : "=???"(return_value) : "l"(addr));
return return_value;
}
, where "???" should be the appropriate constraint letter for a boolean, respectively for an 8 bit unsinged integer (From this question, i deduced this, since it is noted that for >=sm_20, "u8" is used for a boolean). Howevever, I cannot find an appropriate constraint letter in nvidias document "Using inline PTX Assembly in CUDA" (On page 6 are listed some constraint letters). So my question is:
QUESTION
Is there any CUDA inline PTX constraint letter for any of the types:
- boolean
- unsigned 8 bit integer
- or evtl 8 bit binary variable
If not, what can I do in my case (explained in the introduction)? - Can the parameters "b0", "b1", etc shortly discussed here, be of help?
Thank you very much in advance for any help or comments!
UPDATE
I also need a store function reading from L2 cache instead of global memory - i.e. the store function that is complementary to the above ld_gbl_cg function (only once I have this function, I can completely verify that njuffa's answer works). My best guess based on njuffa's answer below would be:
__device__ __forceinline__ void st_gbl_cg (const bool *addr, bool t)
{
#if defined(__LP64__) || defined(_WIN64)
asm ("st.global.cg.u8 [%0], %1;" : "=l"(addr) : "h"((short)t));
#else
asm ("st.global.cg.u8 [%0], %1;" : "=r"(addr) : "h"((short)t));
#endif
}
However, the compiler gives the warning "parameter "addr" was set but never used" and the programm fails at runtime with an "unspecified launch failure". I also tried with .u16 instead of .u8, as I do not know to what exactly it refers. Yet the result is the same.
(Additional information) The following paragraph in the PTX 3.1 documentation, seems to be important for this question:
5.2.2 Restricted Use of Sub-Word Sizes The .u8, .s8, and .b8 instruction types are restricted to ld, st, and cvt instructions. The .f16 floating-point type is allowed only in conversions to and from .f32 and .f64 types. All floating-point instructions operate only on .f32 and .f64 types. For convenience, ld, st, and cvt instructions permit source and destination data operands to be wider than the instruction-type size, so that narrow values may be loaded, stored, and converted using regular-width registers. For example, 8-bit or 16-bit values may be held directly in 32-bit or 64-bit registers when being loaded, stored, or converted to other types and sizes.
.u8
is unsigned 8 bit integer. I don't think there is a boolean built-in type. – Robert Crovella