Cuda Kernel equivalent of a Metal Compute kernel

Question

I have a simple Metal compute kernel that I'm trying to make a Cuda equivalent of. The Metal kernel source is

#include <metal_stdlib>
using namespace metal;

constant uint stride [[function_constant(0)]];
constant float dt    [[function_constant(1)]];
constant float a     [[function_constant(2)]];
constant float b     [[function_constant(3)]];

float2 f(const float2 x) {
    return float2(a, -b)*x.yx;
}

kernel void harmonic_occilator_stride(device float2 *x [[buffer(0)]],
                                             uint    i [[thread_position_in_grid]]) {
    for (uint j = 0; j < stride; j++) {
        x[i] += dt*f(x[i]);
    }
}

My first attempt at converting this to Cuda results in a bunch of errors when compiling the ptx file.

__constant__ uint  stride;
__constant__ float dt;
__constant__ float a;
__constant__ float b;

__device__ float2 f(const float2 x) {
    return float2(a, -b)*x.yx;
}

extern "C" __global__ void harmonic_occilator_stride(float2 *x) {
    size_t i = blockIdx.x*blockDim.x + threadIdx.x;
    for (uint j = 0; j < stride; j++) {
        x[i] += dt*f(x[i]);
    }
}

The first thing it doesn't like is x.yx. In Metal, this reverses the order of the contents of the float2. How do I reverse or change the access order of a vector in Cuda?

The next thing it also doesn't like float2(a, -b). This gives an error "no suitable constructor to convert between float and float2". How do I construct a vector literal?

The last thing it complains about is that there is no * operator for float and float2 for the line dt*f(x[i]). If I remove the dt* and just set it to x[i] += f(x[i]) it complains there is no += operator for float2 and float2. How do I perform operations on these types and can I multiply vectors and scalars?

In Metal when I define a functions as function_constant, the Metal kernel compiler will JIT a specific optimized version of the kernel when the kernel function is loaded at runtime. Does Cuda have this functionality?

Although you didn't ask about it, if it were me I would not wrap my kernel definition in extern "C". If you want C-style linkability, I recommend creating ordinary (i.e. not decorated with __device__ or __global__) wrapper functions, that call the kernel for you. Define those wrapper functions in the same module as the kernel definition, and call those from other modules that need C-linkability. — Robert Crovella
also your loop seems weird to me because the loop body does not depend on the loop variable j. But that is neither here nor there. — Robert Crovella
I need to use extern "C" so it doesn't mangle the name. I using the Driver API on the host side and I want to be able to reference the kernel functions in a consistent way between Cuda, Metal, Vulkan, OpenCL etc... The for loop is there to perform a bunch of time steps in a single kernel call instead of doing one kernel call per time step. Benchmarking in both my original Metal and Cuda implementations showed this produced a 10x speed up over the non looped case. — user1139069

Robert Crovella Robert Crovella · Accepted Answer · 2019-01-18T23:36:50

I've only just glanced at the metal specification now. I'm not going to try to fully address your last question. But I think the syntax questions can be answered conceptually by just dealing with the various components, and following the arithmetic rules defined by metal.

it also doesn't like float2(a, -b). This gives an error "no suitable constructor to convert between float and float2". How do I construct a vector literal?

Use functions defined in the header file vector_functions.h (or .hpp) for this purpose. (see below for an example) The vector types defined for CUDA in vector_types.h don't have constructors.

The first thing it doesn't like is x.yx. In Metal, this reverses the order of the contents of the float2. How do I reverse or change the access order of a vector in Cuda?

CUDA doesn't have this kind of built-in multiple vector element handling/swizzling capability. Just perform the operation on the elements using the element types.

metal:  return float2(a, -b)*x.yx;

CUDA:   #include <vector_functions.h>
        ...
        return make_float2(a*x.y, -b*x.x);

The last thing it complains about is that there is no * operator for float and float2 for the line dtf(x[i]). If I remove the dt and just set it to x[i] += f(x[i]) it complains there is no += operator for float2 and float2. How do I perform operations on these types and can I multiply vectors and scalars?

Similar to above, you'll need to construct the equivalent arithmetic element-wise.

metal:  x[i] += dt*f(x[i]);

CUDA:   float2 temp1 = x[i];
        float2 temp2 = f(temp1);
        temp1.x += dt*temp2.x;
        temp1.y += dt*temp2.y;
        x[i] = temp1;

It should be possible to define a set of vector types of your own, to match most of the capabilities of metal if you wanted to do that. What I'm describing here uses what is "built-in", and could be a model if you wanted to create your own types with contructors, arithmetic operators, etc.

Regarding your last question, CUDA doesn't always JIT at runtime the way you are describing for metal. Probably the closest thing to what you are describing might be something that uses C++ templating, which is supported by CUDA. In general, if you can convert metal operations to equivalent C++ operations, you should be able to directly realize those in CUDA.

Cuda Kernel equivalent of a Metal Compute kernel

1 Answers