I'm running kernel benchmarks with OpenCL. I know that I can compile kernels offline with various tools from OpenCL vendors (i.e. ioc64 or poclcc). The problem is that I get performance results that I cannot explain with the assembly from these tools, the OpenCL runtime overhead or similar.
I would like to see the assembly of online compiled kernels that are compiled and executed by my benchmark program. Any ways to do that?
My approach is to get this assembly somewhere from the cl::program or cl::kernel objects but I haven't found any way to do that. I appreciate your advice or solutions.