Mex Cuda Dynamic Allocation / Slow mex code

Question

I Have cuda/C++ code that returns C++ host-side arrays. I wanted to manipulate these arrays in MATLAB so I rewrote my code in mex format and compiled with mex.

I got it to work by passing preallocated arrays from MATLAB into the mex script but this slowed things down insanely. (54 seconds vs 14 seconds without mex)

Here's the slow solution for a simplified, no input 1 output version of my code:

#include "mex.h"
#include "gpu/mxGPUArray.h"
#include "matrix.h"
#include <stdio.h>
#include <stdlib.h>
#include "cuda.h"
#include "curand.h"
#include <cuda_runtime.h>
#include "math.h"
#include <curand_kernel.h>
#include <time.h>
#include <algorithm>
#include <iostream>

#define iterations 159744
#define transMatrixSize 2592 // Just for clarity. Do not change. No need to adjust this value for this simulation.
#define reps 1024 // Is equal to blocksize. Do not change without proper source code adjustments.
#define integralStep 13125  // Number of time steps to be averaged at the tail of the Force-Time curves to get Steady State Force

__global__ void kern(float *masterForces, ...)
{

int globalIdx = ((blockIdx.x + (blockIdx.y * gridDim.x)) * (blockDim.x * blockDim.y)) + (threadIdx.x + (threadIdx.y * blockDim.x));
...

  ...
   {
...
      {
          masterForces[i] = buffer[0]/24576.0;
      }

      }
   }
...
}



}


void mexFunction(int nlhs, mxArray *plhs[],
                 int nrhs, mxArray const *prhs[])
{
   ...

plhs[0] = mxCreateNumericMatrix(iterations,1,mxSINGLE_CLASS,mxREAL);
float *h_F0 = (float*) mxGetData(plhs[0]);


//Device input vectors
float *d_F0;

..
// Allocate memory for each vector on GPU
cudaMalloc((void**)&d_F0, iterations * sizeof(float));
...




//////////////////////////////////////////////LAUNCH ////////////////////////////////////////////////////////////////////////////////////

kern<<<1, 1024>>>( d_F0);



//////////////////////////////////////////////RETRIEVE DATA ////////////////////////////////////////////////////////////////////////////////////


cudaMemcpyAsync( h_F0 , d_F0 , iterations * sizeof(float), cudaMemcpyDeviceToHost);



///////////////////Free Memory///////////////////


cudaDeviceReset();
////////////////////////////////////////////////////

}

Why so slow?

EDIT: Mex was compiling with an older architecture (SM_13) INSTEAD OF SM_35. Now the time makes sense. (16 s with mex, 14 s with c++/cuda only)

standard cuda mex example "timestwo" mathworks.com/help/distcomp/… — Jordan
That example takes a gpuArray input and returns a gpuArray output. You want to take regular arrays in/out, right? — chappjc
See my UPDATED ANSWER. Also, remove delete h_F0; when using mxCreateNumericMatrix. — chappjc
Really nothing to clean up that I can see. Be sure not to time the first run. — chappjc

chappjc chappjc · Accepted Answer · 2014-07-22T00:18:36

There is no need to use a mxGPUArray if the outputs of your CUDA code are plain-old-data (POD) host-side (vs. device-side) arrays, like your Forces1 array of floats created with new. The MathWorks example that you are referencing is probably demonstrating the use of MATLAB's gpuArray and built-in CUDA functionality, rather than how to pass data to and from regular CUDA functions within a MEX function.

If you can initialize Forces1 (or h_F0 in your full code) outside and before the CUDA function (e.g. in the mexFunction), then the solution is just to change from new to one of the mxCreate* functions (i.e. mxCreateNumericArray, mxCreateDoubleMatrix, mxCreateNumericMatrix, etc.), and then pass the data pointer to your CUDA function:

plhs[0] = mxCreateNumericMatrix(iterations,1,mxSINGLE_CLASS,mxREAL);
float *h_F0 = (float*) mxGetData(plhs[0]);
// myCudaWrapper(...,h_F0 ,...) /* i.e. cudaMemcpyAsync(h_F0,d_F0,...)

The only changes to your code are thus:

Replace:

float *h_F0 = new float[(iterations)];

with

plhs[0] = mxCreateNumericMatrix(iterations,1,mxSINGLE_CLASS,mxREAL);
float *h_F0 = (float*) mxGetData(plhs[0]);

Remove:

delete h_F0;

Note: If instead your CUDA code owned the output host-side array, then you would have to copy the data into an mxArray. This is because unless you allocate the mexFunction outputs with the mx API, any data buffer you assign (e.g. with mxSetData) will not be handled by the MATLAB memory manager, and you will have a segfault or at best, a memory leak.

Mex Cuda Dynamic Allocation / Slow mex code

1 Answers