CUDA Thrust and sort_by_key

Question

I’m looking for a sorting algorithm on CUDA that can sort an array A of elements (double) and returns an array of keys B for that array A. I know the sort_by_key function in the Thrust library but I want my array of elements A to remain unchanged. What can I do?

My code is:

void sortCUDA(double V[], int P[], int N) {

        real_t *Vcpy = (double*) malloc(N*sizeof(double));
        memcpy(Vcpy,V,N*sizeof(double));

        thrust::sort_by_key(V, V + N, P);
        free(Vcpy);
}

i'm comparing the thrust algorithm against others that i have on sequencial cpu

N               mergesort       sortCUDA
113             0.000008        0.000010
226             0.000018        0.000016
452             0.000036        0.000020
905             0.000061        0.000034
1810            0.000135        0.000071
3621            0.000297        0.000156
7242            0.000917        0.000338
14484           0.001421        0.000853
28968           0.003069        0.001931
57937           0.006666        0.003939
115874          0.014435        0.008025
231749          0.031059        0.016718
463499          0.067407        0.039848
926999          0.148170        0.118003
1853998         0.329005        0.260837
3707996         0.731768        0.544357
7415992         1.638445        1.073755
14831984        3.668039        2.150179
115035495       39.276560       19.812200
230070990       87.750377       39.762915
460141980       200.940501      74.605219

Thrust performance is not bad, but I think if I use OMP can probably get easily a better CPU time

I think this is because to memcpy

SOLUTION:

void thrustSort(double V[], int P[], int N)
{
        thrust::device_vector<int> d_P(N);
        thrust::device_vector<double> d_V(V, V + N);
        thrust::sequence(d_P.begin(), d_P.end());

        thrust::sort_by_key(d_V.begin(), d_V.end(), d_P.begin());

        thrust::copy(d_P.begin(),d_P.end(),P);
}

where V is a my double values to sort

Make a copy of A before sorting? Also, if you are a thrust user, you may want to consider joining the thrust google group. — Robert Crovella
Perhaps you should post some code and answer the questions about sizes. I would expect the cost of the sorting operation to be significantly higher than the cost for a vector copy. — Robert Crovella
You'll need to learn more about thrust, perhaps take a look at the quick start guide. Vectors can live on the host or device. If you pass vectors (or pointers to arrays) that are host-based, Thrust will use a host-based algorithm to sort (leaving the GPU idle). If you pass vectors or pointers that are device-based, Thrust will use a device-based algorithm to sort (i.e. on the GPU). Your code that you posted gives me the impression that your pointers are host based. — Robert Crovella
I'm actually impressed that Thrust is faster than your mergesort, even for sizes as small as 226, especially since you are adding in the cost of the vector copy (don't know if you are doing that with your mergesort -- you didn't post that code.) If you use the thrust device sort, there will be a cost to copy the vectors to the device. This will penalize your small-size sorts but probably give a substantial improvement on the large size ones. Also, the development version of thrust should be substantially faster at sorting. — Robert Crovella

Unknown Unknown · Accepted Answer · 2012-11-22T22:25:33

You can modify comparison operator to sort keys instead of values. @Robert Crovella correctly pointed that a raw device pointer cannot be assigned from the host. The modified algorithm is below:

struct cmp : public binary_function<int,int,bool>
{
  cmp(const double *ptr) : rawA(ptr) { }

  __host__ __device__ bool operator()(const int i, const int j) const 
  {return rawA[i] > rawA[j];}

   const double *rawA; // an array in global mem
}; 

void sortkeys(double *A, int n) {
  // move data to the gpu
  thrust::device_vector<double> devA(A, A + n);
  double *rawA = thrust::raw_pointer_cast(devA.data());

  thrust::device_vector<int> B(n);
  // initialize keys
  thrust::sequence(B.begin(), B.end());
  thrust::sort(B.begin(), B.end(), cmp(rawA));
  // B now contains the sorted keys
 }

And here is alternative with arrayfire. Though I am not sure which one is more efficient since arrayfire solution uses two additional arrays:

void sortkeys(double *A, int n) {
   af::array devA(n, A, af::afHost);
   af::array vals, indices;
   // sort and populate vals/indices arrays
   af::sort(vals, indices, devA);
   std::cout << devA << "\n" << indices << "\n";
}

CUDA Thrust and sort_by_key

3 Answers