1
votes

I'm trying to run a cublas function from within a kernel in the following way:

__device__ void doLinear(const float *W,const float *input, unsigned i, float *out, unsigned o) {
    unsigned idx = blockIdx.x*blockDim.x+threadIdx.x;

    const float alpha = 1.0f;
    const float beta = 0.0f;

    if(idx == 0) {
        cublasHandle_t cnpHandle;
        cublasStatus_t status = cublasCreate(&cnpHandle);
        cublasSgemv(cnpHandle, CUBLAS_OP_N, o, i, &alpha, W, 1, input, 1, &beta, out, 1);
    }
    __syncthreads();
}

This function works perfectly well if the input pointer is allocated using cudaMalloc.

My issue is, if the input pointer actually points to some shared memory, that contains data generated from within the kernel, I get the error: CUDA_EXCEPTION_14 - Warp Illegal address.

Is it not possible to pass pointers to shared memory to a cublas function being called from a kernel?

What is the correct way to allocate my memory here? (At the moment I'm just doing another cudaMalloc and using that as my 'shared' memory, but it's making me feel a bit dirty)

1

1 Answers

2
votes

You can't pass shared memory to a CUBLAS device API routine because it violates the CUDA dynamic parallelism memory model on which device side CUBLAS is based. The best you can do is use malloc() or new to allocate thread local memory on the runtime heap for the CUBLAS routine to use, or a portion of an a priori allocated buffer allocated with one of the host side APIs (as you are presently doing).