Thrust: sort_by_key with zip_iterator performance

Question

Problem

I am using sort_by_key with the values being passed using a zip_iterator. This sort_by_key is called many times, and after a certain iteration it becomes 10x slower! What is the cause of this drop in performance?

Symptom

I am sorting 3 vectors using sort_by_key, one of them acts as the key vector:

struct Segment
{
  int v[2];
};

thrust::device_vector<int> keyVec;
thrust::device_vector<int> valVec;
thrust::device_vector<Segment> segVec;

// ... code which fills these vectors ...

thrust::sort_by_key( keyVec.begin(), keyVec.end(),
                     make_zip_iterator( make_tuple( valVec.begin(), segVec.begin() ) ) );

The size of the vector is usually about 4 million. In the initial 2 times it is called, the sort_by_key takes 0.04s, in loop 3 it takes 0.1s and then degrades further to 0.3s for the rest of the loops. Thus, we see a 10x degradation in performance.

Extra Information

To ensure that the only factor of degradation was sort_by_key, I replaced the above with manual sorting using a handwritten kernel:

thrust::device_vector<int> indexVec( keyVec.size() );
thrust::sequence( indexVec.begin(), indexVec.end() );

// Sort the keys and indexes
thrust::sort_by_key( keyVec.begin(), keyVec.end(), indexVec.begin() );

thrust::device_vector<int> valVec2( keyVec.size() );
thrust::device_vector<Segment> segVec2( keyVec.size() );

// Use index array and move vectors to destination
moveKernel<<< x, y >>>(
  toRawPtr( indexVec ),
  indexVec.size(),
  toRawPtr( valVec ),
  toRawPtr( segVec ),
  toRawPtr( valVec2 ),
  toRawPtr( segVec2 ) );

// Swap back into original vectors
valVec.swap( valVec2 );
segVec.swap( segVec2 );

This handwritten sort takes 0.03s and this performance is consistent across all iterations, unlike the performance drop seen with sort_by_key and zip_iterator.

Pavan Yalamanchili Pavan Yalamanchili · Accepted Answer · 2011-04-22T19:03:21

For accurate timing across each loop, you need to use cudaThreadSynchronize at the end of each loop. The timings you are getting for the first two loops may not be the actual timing you are looking for.

Thrust: sort_by_key with zip_iterator performance

1 Answers