CUDA device to host copy very slow

Question

I'm running windows 7 64 bits, cuda 4.2, visual studio 2010.

First, I run some code on cuda, then download the data back to host. Then do some processing and move back to the device. Then I did the following copy from device to host, it runs very fast, like 1ms.

clock_t start, end;
count=1000000;
thrust::host_vector <int> h_a(count);
thrust::device_vector <int> d_b(count,0);
int *d_bPtr = thrust::raw_pointer_cast(&d_b[0]);
start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;

It takes ~1ms to finish.

Then I ran some other code on the cuda again, mainly atomic operations. Then I copy the data from device to host, it takes very long time， like ~9s.

__global__ void dosomething(int *d_bPtr)
{
....
atomicExch(d_bPtr,c)
....
}

start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;

~ 9s

I ran the code multiple times, for example

int i=0;
while (i<10)
{
clock_t start, end;
count=1000000;
thrust::host_vector <int> h_a(count);
thrust::device_vector <int> d_b(count,0);
int *d_bPtr = thrust::raw_pointer_cast(&d_b[0]);
start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;

__global__ void dosomething(int *d_bPtr)
{
....
atomicExch(d_bPtr,c)
....
}

start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;
i++
}

The results are pretty much the same.
What could be the problem?

Thank you!

I'm curious, Are you calling atomicExch at the host side programming or device function? atomicExch can only be called in the device function/kernel. — Hong Zhou
I still dont understand how are you able to use thrust::raw_ptr_cast with device_vector first index.I am trying to run a snippet from your code and I am getting error: argument list for class template "thrust::device_ptr" is missing error... — Recker
I'm sorry, my bad. it should be int *device_ptr = thrust::raw_pointer_cast(&d_b[0]); I will update it. Do you think it is this one causing problem?? Or should I directly use d_b.begin() as the input for the atomic operation? Thank you! — UserKiwi
could you post the shortest reproducer you can come up with. I tried making a simple example from your code but didn't see anything wrong. There are various odd syntax errors in your code, so it would help to have a compilable reproducer. — Robert Crovella
I'm really sorry, it is my bad. I didn't bring my source code. I will follow talonmies's suggestions, re-run the test again. And also post a complilable code as soon as possible tomorrow. Thank you very much!! — UserKiwi

talonmies talonmies · Accepted Answer · 2012-10-09T05:05:50

The problem is one of timing, not of any change in copy performance. Kernel launches are asynchronous in CUDA, so what you are measuring is not just the time for thrust::copy but also for the prior kernel you launched to complete. If you change you code for timing the copy operation to something like this:

cudaDeviceSynchronize(); // wait until prior kernel is finished
start=clock();
thrust::copy(d_b.begin(), d_b.end(), h_a.begin());
end=clock();
cout<<"Time Spent:"<<end-start<<endl;

You should find the transfer times are restored to their previous performance. So you real question isn't "why is thrust::copy slow", it is "why is my kernel slow". And based on the rather terrible pseudo code you posted, the answer is "because it is full of atomicExch() calls which serialise kernel memory transactions".

CUDA device to host copy very slow

2 Answers