4
votes

I am currently trying to delete part of the cumbersome cudaMallocHost/cudaFreeHost from my code. To do so, I am willing to use only std::vector, but I absolutely need that the underlying memory must be of pinned cuda memory type.

But, I am facing strange behaviour using the thrust::system::cuda::experimental::pinned_allocator<> from the thrust library:

//STL
#include <iostream>
#include <string>
#include <vector>
#include <algorithm>

//CUDA
#include <cuda_runtime.h>
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/system/cuda/experimental/pinned_allocator.h>

#define SIZE 4
#define INITVAL 2
#define ENDVAL 4

//Compile using nvcc ./main.cu -o test -std=c++11
int main( int argc, char* argv[] )
{
    // init host
    std::vector<float,thrust::system::cuda::experimental::pinned_allocator<float> > hostVec(SIZE);
    std::fill(hostVec.begin(),hostVec.end(),INITVAL);

    //Init device
    thrust::device_vector<float> thrustVec(hostVec.size());

    //Copy
    thrust::copy(hostVec.begin(), hostVec.end(), thrustVec.begin());

    //std::cout << "Dereferencing values of the device, values should be "<< INITVAL << std::endl;
    std::for_each(thrustVec.begin(),thrustVec.end(),[](float in){ std::cout <<"val is "<<in<<std::endl;} );
    std::cout << "------------------------" << std::endl;

    //Do Stuff
    thrust::transform( thrustVec.begin(), thrustVec.end(), thrust::make_constant_iterator(2), thrustVec.begin(), thrust::multiplies<float>() );

    //std::cout << "Dereferencing values of the device, values should now be "<< ENDVAL << std::endl;
    std::for_each(thrustVec.begin(),thrustVec.end(),[](float in){ std::cout <<"val is "<<in<<std::endl;} );
    std::cout << "------------------------" << std::endl;

    //Copy back
    thrust::copy(thrustVec.begin(), thrustVec.end(), hostVec.begin());

    //Synchronize
    //cudaDeviceSynchronize(); //makes the weird behaviour to go away

    //Check result
    //std::cout << "Dereferencing values on the host, values should now be "<< ENDVAL << std::endl;//Also makes the weird behaviour to go away

    std::for_each(hostVec.begin(),hostVec.end(),[](float in){ std::cout <<"val is "<<in<<std::endl;} ); 

    return EXIT_SUCCESS;
}

Which, in my setup, gives:

val is 2
val is 2
val is 2
val is 2
------------------------
val is 4
val is 4
val is 4
val is 4
------------------------
val is 2
val is 4
val is 4
val is 4

Why does the copy from device to host seems to fail ? Nvvp however shows a perfectly fine chronogram with the right values for copy.

By the way, I use NVCC/cuda/thrust from the 7.5 package, and gcc (GCC) 4.8.5 with a titanX card.

Thank you in advance for your help.

1
I can't reproduce this on any platform I have access to. What happens if you add a synchronization call before you try printing out the vector after the device to host copy?talonmies
I can also reproduce the error on a gtx680 (compute capability 3.0). Indeed, adding a cudaDeviceSynchronize make the code to run as intended. I believed that thrust::copy had a synchronous behaviour, but actually there is no information about the synchronous/asynchronous behaviour of thrust in the documentation: thrust.github.io/doc/…Tobbey

1 Answers

6
votes

This was a real bug, and thrust developpers were already aware of it, see https://github.com/thrust/thrust/issues/775

Using the latest 1.8.3 version of thrust from the github repository solved the problem for me.