Well, I have quite a delicate question :)
Let's start with what I have:
- Data, large array of data, copied to GPU
- Program, generated by CPU (host), which needs to be evaluated for every data in that array
- The program changes very frequently, can be generated as CUDA string, PTX string or something else (?) and needs to be re-evaluated after each change
What I want: Basically just want to make this as effective (fast) as possible, eg. avoid compilation of CUDA to PTX. Solution can be even completely device-specific, no big compatibility is required here :)
What I know: I already know function cuLoadModule, which can load and create kernel from PTX code stored in file. But I think, there must be some other way to create a kernel directly, without saving it to file first. Or perhaps it may be possible to store it as bytecode?
My question: How would you do that? Could you post an example or link to website with similar topic? TY
Edit: OK now, PTX kernel can be run from PTX string (char array) directly. Anyways I still wonder, is there some better / faster solution to this? There is still conversion from string to some PTX bytecode, which should be possibly avoided. I also suspect, that some clever way of creating device specific Cuda binary from PTX might exist, which would remove JIT compiler lag (is small, but it can add up if you have huge numbers of kernels to run) :)