Is there a way to get detailled information about how an OpenCL kernel was compiled on NVidia platforms (or on other platforms). Either external tools or tests that can be put into the kernel. Specifically:
Did vectorization succeed, and how are did the work items get grouped into warps?
If work items inside a work group go into different branches, did the compiler optimize it so that they still execute in parallel?
Did private memory variables get mapped to registers in the multiprocessor, or were they put into local/global memory? (Some architectures have more private memory per work group than local memory)
Can this information be seen in the PTX assembly output, or is this still higher level?