It seems that device_vector.swap()
avoids memory movements.
Indeed, consider the following code:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <thrust\device_vector.h>
void printDeviceVector(thrust::device_vector<int> &d_a) {
for (int k = 0; k < d_a.size(); k++) {
int temp = d_a[k];
printf("%i\n", temp);
}
}
int main()
{
const int N = 10;
thrust::device_vector<int> d_a(N, 1);
thrust::device_vector<int> d_b(N, 2);
printf("Original device vector d_a\n");
printDeviceVector(d_a);
printf("Original device vector d_b\n");
printDeviceVector(d_b);
d_b.swap(d_a);
printf("Final device vector d_a\n");
printDeviceVector(d_a);
printf("Final device vector d_b\n");
printDeviceVector(d_b);
d_a.clear();
thrust::device_vector<int>().swap(d_a);
d_b.clear();
thrust::device_vector<int>().swap(d_b);
cudaDeviceReset();
return 0;
}
using
d_b.swap(d_a);
If we profile it, we see no device-to-device memory movement in the timeline:

If, on the other side, we change d_b.swap(d_a)
to
d_b = d_a
then device-to-device movements appear in the timeline:

Finally, the timing is significantly in favor of d_b.swap(d_a)
, rather than d_b = d_a
. For N = 33554432
, the timing is
d_b.swap(d_a) 0.001152ms
d_b = d_a 3.181824ms
thrust::vector
class has a member of typecontiguous_storage
which is used for storing the vector contents. When vectors are swapped, internally, only thebegin()
iterator,size
andallocator
ofcontiguous_storage
are swapped. So there is no memory copy of data involved. You can check this in theswap
member function inside the filecontiguous_storage.inl
. – sgarizvivector_base::operator=
, it uses theassign
function which seems to perform complete memory copy of the vector contents. – sgarizvithrust
uses a kernel to perform the copy? – Vitalityswap
in my main code, with benefits on timing. At least, it seems thatswap
is faster than copy. – Vitality