3
votes

If I have two cudaMalloced arrays, I can swap them without memory movements by simply swapping the related pointers.

If I have two CUDA Thrust device_vectors, say d_a and d_b, I can swap them by using a third temorary vector, say d_c, but this will require memory movements.

My question is: is there a way to swap CUDA Thrust device_vectors without memory movements?

2
The thrust::vector class has a member of type contiguous_storage which is used for storing the vector contents. When vectors are swapped, internally, only the begin() iterator, size and allocator of contiguous_storage are swapped. So there is no memory copy of data involved. You can check this in the swap member function inside the file contiguous_storage.inl.sgarizvi
In the case of assignment operator, if you look at the code of vector_base::operator=, it uses the assign function which seems to perform complete memory copy of the vector contents.sgarizvi
@sgarizvi Thanks for your comments. Actually, this is the same objection @talonmies pointed out in his comments below. However, the strange thing is that I cannot find memory copies in the timeline. Perhaps thrust uses a kernel to perform the copy?Vitality
@sgarizvi I have replaced the explicit copies with swap in my main code, with benefits on timing. At least, it seems that swap is faster than copy.Vitality
I think this is s relatively new feature of the vector classes. In the old days I'm pretty sure swap used copy assignment and that triggered memory copies.talonmies

2 Answers

5
votes

It seems that device_vector.swap() avoids memory movements.

Indeed, consider the following code:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>

#include <thrust\device_vector.h>

void printDeviceVector(thrust::device_vector<int> &d_a) {

    for (int k = 0; k < d_a.size(); k++) {

        int temp = d_a[k];
        printf("%i\n", temp);

    }

}

int main()
{
    const int N = 10;

    thrust::device_vector<int> d_a(N, 1);
    thrust::device_vector<int> d_b(N, 2);

    // --- Original
    printf("Original device vector d_a\n");
    printDeviceVector(d_a);
    printf("Original device vector d_b\n");
    printDeviceVector(d_b);

    d_b.swap(d_a);

    // --- Original
    printf("Final device vector d_a\n");
    printDeviceVector(d_a);
    printf("Final device vector d_b\n");
    printDeviceVector(d_b);

    d_a.clear();
    thrust::device_vector<int>().swap(d_a); 
    d_b.clear();
    thrust::device_vector<int>().swap(d_b);

    cudaDeviceReset();

    return 0;
}

using

    d_b.swap(d_a);

If we profile it, we see no device-to-device memory movement in the timeline:

enter image description here

If, on the other side, we change d_b.swap(d_a) to

d_b = d_a;

then device-to-device movements appear in the timeline:

enter image description here

Finally, the timing is significantly in favor of d_b.swap(d_a), rather than d_b = d_a. For N = 33554432, the timing is

d_b.swap(d_a)     0.001152ms
d_b = d_a         3.181824ms
3
votes

Not that I am aware of.

There is no constructor exposed which takes an existing device_ptr, and the underlying base vector within device_vector is private, so there is no way to dive in and perform pointer exchange yourself. Those would be the only ways I can think of to make this work without triggering the standard copy constructor.


Edit to add that it appears this answer is wrong. It seems that recent (probably around thrust 1.6) changes have implemented an internal pointer exchange swap mechanism which can be called via device_vector.swap(). This bypasses the usual copy-constructor idiom for swap() and will not trigger memory transfers .