1
votes

JCuda + GEForce Gt640 Question:

I'm trying to reduce the latency associated with copying memory from Device to Host after the result has been computed by the GPU. Doing the simple Vector Add program I found that the bulk of the latency is indeed copying the result buffer back to the Host side. The transfer latency of the source buffers to the Device side is negligible ~.30ms while copying the result back is on the order of 20ms.

I did the research an found that a better alternative to copying out the results is to use pinned memory. From what I learned, this memory is allocated on the host side but the kernel would have direct access to it over the pci-e and in turn yielding a higher speed than copying the result after the computation in bulk. I'm using the following example but the results aren't yielding what I expect.

Kernel: {Simple Example to illustrate point, Launching 1 block 1 thread only}

extern "C"
__global__ void add(int* test)
{
    test[0]=1; test[1]=2; test[2]=3; test[3]=4; test[4]=5;
}

Java:

import java.io.*;
import jcuda.*;
import jcuda.runtime.*;
import jcuda.driver.*;

import static jcuda.runtime.cudaMemcpyKind.*;
import static jcuda.driver.JCudaDriver.*;

public class JCudaTest
{
    public static void main(String args[])
    {
        // Initialize the driver and create a context for the first device.
        cuInit(0);
        CUdevice device = new CUdevice();
        cuDeviceGet(device, 0);
        CUcontext context = new CUcontext();
        cuCtxCreate(context, 0, device);

        // Load the ptx file.
        CUmodule module = new CUmodule();
        JCudaDriver.cuModuleLoad(module, "JCudaKernel.ptx");

        // Obtain a function pointer to the kernel function.
        CUfunction function = new CUfunction();
        JCudaDriver.cuModuleGetFunction(function, module, "add");

        Pointer P = new Pointer();
        JCudaDriver.cuMemAllocHost(P, 5*Sizeof.INT);

        Pointer kernelParameters = Pointer.to(P);
        // Call the kernel function with 1 block, 1 thread:
        JCudaDriver.cuLaunchKernel(function, 1, 1, 1, 1, 1, 1, 0, null, kernelParameters, null);
        int [] T = new int[5];
        JCuda.cudaMemcpy(Pointer.to(T), P, 5*Sizeof.INT, cudaMemcpyHostToHost);

         // Print the results:
         for(int i=0; i<5; i++)
                System.out.println(T[i]);
    }
}

1.) Build the Kernel: root@NVS295-CUDA:~/JCUDA/MySamples# nvcc -ptx JCudaKernel.cu root@NVS295-CUDA:~/JCUDA/MySamples# ls -lrt | grep ptx -rw-r--r-- 1 root root 3295 Mar 27 17:46 JCudaKernel.ptx

2.) Build the Java: root@NVS295-CUDA:~/JCUDA/MySamples# javac -cp "../JCuda-All-0.5.0-bin-linux-x86/*:." JCudaTest.java

3.) Run the code: root@NVS295-CUDA:~/JCUDA/MySamples# java -cp "../JCuda-All-0.5.0-bin-linux-x86/*:." JCudaTest 0 0 0 0 0

Expecting: 1 2 3 4 5

Note: I'm using JCuda0.5.0 for x86 if that matters.

Please let me know what I'm doing wrong and thanks in advance: Ilir

1
You haven't actually asked a question here....talonmies
What am I doing wrong?Ilir Iljazi

1 Answers

2
votes

The problem here is that the device may not access host memory directly.

Admittedly, the documentation sounds misleading here:

cuMemAllocHost

Allocates bytesize bytes of host memory that is page-locked and accessible to the device...

This sounds like a clear statement. However, "accessible" here does not mean that the memory may be used directly as a kernel parameter in all cases. This is only possible on devices that support Unified Addressing. For all other devices, it is necessary to obtain a device pointer that corresponds to the allocated host pointer, with cuMemHostGetDevicePointer.

The key point of page-locked host memory is that the data transfer between the host and device is faster. An example of how this memory may be used in JCuda can be seen in the JCudaBandwidthTest sample (this is for the runtime API, but for the driver API, it works analogously).

EDIT:

Note that the new Unified Memory feature of CUDA 6 actually supports what you originally intended to do: With cudaMallocManaged you can allocate memory that is directly accessible to the host and the device (in the sense that it can, for example, be passed to a kernel, written by the device, and afterwards read by the host without additional effort). Unfortunately, this concept does not map very well to Java, because the memory is still managed by CUDA - and this memory can not replace the memory that is, for example, used by the Java VM for a float[] array or so. But at least it should be possible to create a ByteBuffer from the memory that was allocated with cudaMallocManaged, so that you may access this memory, for example, as a FloatBuffer.