How to generate, compile and run CUDA kernels at runtime

Question

Well, I have quite a delicate question :)

Let's start with what I have:

Data, large array of data, copied to GPU
Program, generated by CPU (host), which needs to be evaluated for every data in that array
The program changes very frequently, can be generated as CUDA string, PTX string or something else (?) and needs to be re-evaluated after each change

What I want: Basically just want to make this as effective (fast) as possible, eg. avoid compilation of CUDA to PTX. Solution can be even completely device-specific, no big compatibility is required here :)

What I know: I already know function cuLoadModule, which can load and create kernel from PTX code stored in file. But I think, there must be some other way to create a kernel directly, without saving it to file first. Or perhaps it may be possible to store it as bytecode?

My question: How would you do that? Could you post an example or link to website with similar topic? TY

Edit: OK now, PTX kernel can be run from PTX string (char array) directly. Anyways I still wonder, is there some better / faster solution to this? There is still conversion from string to some PTX bytecode, which should be possibly avoided. I also suspect, that some clever way of creating device specific Cuda binary from PTX might exist, which would remove JIT compiler lag (is small, but it can add up if you have huge numbers of kernels to run) :)

In CUDA Expression Templates, the Authors generate a CUDA kernel for each expression type at runtime using the expression templates techniques. Have a look at the code. Would you be interested in something like that? — Vitality
+1 for pointing to interesting and relevant article, they end up loading PTX from file at the end, but I like the way they abstract vector ops from cuda completely:) nice to see how others do it, thank u for linking the code Jack! — teejay
I think that the above paper is the first example of using expression templates with CUDA. The criticisms that comes to my mind is that perhaps the need of generating CUDA code on-the-fly at run-time and of compiling and loading the PTX code will frustrate the advantage of using expression templates. If you are interested in expression templates in CUDA, since then, other libraries have been developed: Newton using thrust, J.M. Cohen, "Processing Device Arrays with C++ Metaprogramming", GPU Computing Gems - Jade Edition and others. — Vitality
Recently, we have developed a BlueBird expression templates library working both for host and device and aimed to approach a Matlab like syntax. It is currently a beta version. — Vitality

Vitality Vitality · Accepted Answer · 2013-11-07T22:08:41

In his comment, Roger Dahl has linked the following post

Passing the PTX program to the CUDA driver directly

in which the use of two functions, namely cuModuleLoad and cuModuleLoadDataEx, are addressed. The former is used to load PTX code from file and passing it to the nvcc compiler driver. The latter avoids I/O and enables to pass the PTX code to the driver as a C string. In either cases, you need to have already at your disposal the PTX code, either as the result of the compilation of a CUDA kernel (to be loaded or copied and pasted in the C string) or as an hand-written source.

But what happens if you have to create the PTX code on-the-fly starting from a CUDA kernel? Following the approach in CUDA Expression templates, you can define a string containing your CUDA kernel like

ss << "extern \"C\" __global__ void kernel( ";
ss << def_line.str() << ", unsigned int vector_size, unsigned int number_of_used_threads ) { \n";
ss << "\tint idx = blockDim.x * blockIdx.x + threadIdx.x; \n";
ss << "\tfor(unsigned int i = 0; i < ";
ss << "(vector_size + number_of_used_threads - 1) / number_of_used_threads; ++i) {\n";
ss << "\t\tif(idx < vector_size) { \n";
ss << "\t\t\t" << eval_line.str() << "\n";
ss << "\t\t\tidx += number_of_used_threads;\n";
ss << "\t\t}\n";
ss << "\t}\n";
ss << "}\n\n\n\n";

then using system calls to compile it as

int nvcc_exit_status = system(
         (std::string(NVCC) + " -ptx " + NVCC_FLAGS + " " + kernel_filename 
              + " -o " + kernel_comp_filename).c_str()
    );

    if (nvcc_exit_status) {
            std::cerr << "ERROR: nvcc exits with status code: " << nvcc_exit_status << std::endl;
            exit(1);
    }

and finally use cuModuleLoad and cuModuleGetFunction to load the PTX code from file and passing it to the compiler driver like

    result = cuModuleLoad(&cuModule, kernel_comp_filename.c_str());
    assert(result == CUDA_SUCCESS);
    result =  cuModuleGetFunction(&cuFunction, cuModule, "kernel");
    assert(result == CUDA_SUCCESS);

Of course, expression templates have nothing to do with this problem and I'm only quoting the source of the ideas I'm reporting in this answer.

How to generate, compile and run CUDA kernels at runtime

1 Answers