JCuda + GEForce Gt640 Question:
I'm trying to reduce the latency associated with copying memory from Device to Host after the result has been computed by the GPU. Doing the simple Vector Add program I found that the bulk of the latency is indeed copying the result buffer back to the Host side. The transfer latency of the source buffers to the Device side is negligible ~.30ms while copying the result back is on the order of 20ms.
I did the research an found that a better alternative to copying out the results is to use pinned memory. From what I learned, this memory is allocated on the host side but the kernel would have direct access to it over the pci-e and in turn yielding a higher speed than copying the result after the computation in bulk. I'm using the following example but the results aren't yielding what I expect.
Kernel: {Simple Example to illustrate point, Launching 1 block 1 thread only}
extern "C"
__global__ void add(int* test)
{
test[0]=1; test[1]=2; test[2]=3; test[3]=4; test[4]=5;
}
Java:
import java.io.*;
import jcuda.*;
import jcuda.runtime.*;
import jcuda.driver.*;
import static jcuda.runtime.cudaMemcpyKind.*;
import static jcuda.driver.JCudaDriver.*;
public class JCudaTest
{
public static void main(String args[])
{
// Initialize the driver and create a context for the first device.
cuInit(0);
CUdevice device = new CUdevice();
cuDeviceGet(device, 0);
CUcontext context = new CUcontext();
cuCtxCreate(context, 0, device);
// Load the ptx file.
CUmodule module = new CUmodule();
JCudaDriver.cuModuleLoad(module, "JCudaKernel.ptx");
// Obtain a function pointer to the kernel function.
CUfunction function = new CUfunction();
JCudaDriver.cuModuleGetFunction(function, module, "add");
Pointer P = new Pointer();
JCudaDriver.cuMemAllocHost(P, 5*Sizeof.INT);
Pointer kernelParameters = Pointer.to(P);
// Call the kernel function with 1 block, 1 thread:
JCudaDriver.cuLaunchKernel(function, 1, 1, 1, 1, 1, 1, 0, null, kernelParameters, null);
int [] T = new int[5];
JCuda.cudaMemcpy(Pointer.to(T), P, 5*Sizeof.INT, cudaMemcpyHostToHost);
// Print the results:
for(int i=0; i<5; i++)
System.out.println(T[i]);
}
}
1.) Build the Kernel: root@NVS295-CUDA:~/JCUDA/MySamples# nvcc -ptx JCudaKernel.cu root@NVS295-CUDA:~/JCUDA/MySamples# ls -lrt | grep ptx -rw-r--r-- 1 root root 3295 Mar 27 17:46 JCudaKernel.ptx
2.) Build the Java: root@NVS295-CUDA:~/JCUDA/MySamples# javac -cp "../JCuda-All-0.5.0-bin-linux-x86/*:." JCudaTest.java
3.) Run the code: root@NVS295-CUDA:~/JCUDA/MySamples# java -cp "../JCuda-All-0.5.0-bin-linux-x86/*:." JCudaTest 0 0 0 0 0
Expecting: 1 2 3 4 5
Note: I'm using JCuda0.5.0 for x86 if that matters.
Please let me know what I'm doing wrong and thanks in advance: Ilir