Swapping CUDA Thrust device vectors without memory movements

Question

If I have two cudaMalloced arrays, I can swap them without memory movements by simply swapping the related pointers.

If I have two CUDA Thrust device_vectors, say d_a and d_b, I can swap them by using a third temorary vector, say d_c, but this will require memory movements.

My question is: is there a way to swap CUDA Thrust device_vectors without memory movements?

The thrust::vector class has a member of type contiguous_storage which is used for storing the vector contents. When vectors are swapped, internally, only the begin() iterator, size and allocator of contiguous_storage are swapped. So there is no memory copy of data involved. You can check this in the swap member function inside the file contiguous_storage.inl. — sgarizvi
In the case of assignment operator, if you look at the code of vector_base::operator=, it uses the assign function which seems to perform complete memory copy of the vector contents. — sgarizvi
@sgarizvi Thanks for your comments. Actually, this is the same objection @talonmies pointed out in his comments below. However, the strange thing is that I cannot find memory copies in the timeline. Perhaps thrust uses a kernel to perform the copy? — Vitality
@sgarizvi I have replaced the explicit copies with swap in my main code, with benefits on timing. At least, it seems that swap is faster than copy. — Vitality
I think this is s relatively new feature of the vector classes. In the old days I'm pretty sure swap used copy assignment and that triggered memory copies. — talonmies

Vitality Vitality · Accepted Answer · 2017-10-05T16:44:56

It seems that device_vector.swap() avoids memory movements.

Indeed, consider the following code:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

#include <thrust\device_vector.h>

void printDeviceVector(thrust::device_vector<int> &d_a) {

    for (int k = 0; k < d_a.size(); k++) {

        int temp = d_a[k];
        printf("%i\n", temp);

    }

}

int main()
{
    const int N = 10;

    thrust::device_vector<int> d_a(N, 1);
    thrust::device_vector<int> d_b(N, 2);

    // --- Original
    printf("Original device vector d_a\n");
    printDeviceVector(d_a);
    printf("Original device vector d_b\n");
    printDeviceVector(d_b);

    d_b.swap(d_a);

    // --- Original
    printf("Final device vector d_a\n");
    printDeviceVector(d_a);
    printf("Final device vector d_b\n");
    printDeviceVector(d_b);

    d_a.clear();
    thrust::device_vector<int>().swap(d_a); 
    d_b.clear();
    thrust::device_vector<int>().swap(d_b);

    cudaDeviceReset();

    return 0;
}

using

    d_b.swap(d_a);

If we profile it, we see no device-to-device memory movement in the timeline:

If, on the other side, we change d_b.swap(d_a) to

d_b = d_a;

then device-to-device movements appear in the timeline:

Finally, the timing is significantly in favor of d_b.swap(d_a), rather than d_b = d_a. For N = 33554432, the timing is

d_b.swap(d_a)     0.001152ms
d_b = d_a         3.181824ms

Swapping CUDA Thrust device vectors without memory movements

2 Answers