0
votes

I have a thrust device_vector divided into chunks of 100 (but altogether contiguous on GPU memory), and i want to remove the last 5 elements of each chunk, without having to reallocate a new device_vector to copy it into.

// Layout in memory before (number of elements in each contiguous subblock listed):
// [   95   | 5 ][   95   | 5 ][   95   | 5 ]........

// Layout in memory after cutting out the last 5 of each chunk (number of elements listed)
// [  95  ][  95  ][  95  ].........

thrust::device_vector v;
// call some function on v;

// so elements 95-99, 195-99, 295-299, etc are removed (assuming 0-based indexing)

How can I correctly implement this? Preferably I would like to avoid allocating a new vector in GPU memory to save the transform into. I understand there are Thrust template functions for dealing with these kinds of operations, but I have trouble stringing them together. Is there something Thrust provides that can do this?

1
Are the chunks all the same size?talonmies
yes, the chunks are all of the same sizeuser1522407
Seems difficult to do this operation in place in parallel, but if you're willing to allocate temporary storage, it's just a call to thrust::copy_if.Jared Hoberock

1 Answers

1
votes

No allocation of the buffer mem means you have to preserve the copying order, which can not be paralleled to fully utilize the GPU hardware.

Here's a version for doing this using Thrust with a buffer mem.

It requires Thrust 1.6.0+ since the lambda expression functor is used on iterators.

#include "thrust/device_vector.h"
#include "thrust/iterator/counting_iterator.h"
#include "thrust/iterator/permutation_iterator.h"
#include "thrust/iterator/transform_iterator.h"
#include "thrust/copy.h"
#include "thrust/functional.h"

using namespace thrust::placeholders;

int main()
{
    const int oldChunk = 100, newChunk = 95;
    const int size = 10000;

    thrust::device_vector<float> v(
            thrust::counting_iterator<float>(0),
            thrust::counting_iterator<float>(0) + oldChunk * size);
    thrust::device_vector<float> buf(newChunk * size);

    thrust::copy(
            thrust::make_permutation_iterator(
                    v.begin(),
                    thrust::make_transform_iterator(
                            thrust::counting_iterator<int>(0),
                            _1 / newChunk * oldChunk + _1 % newChunk)),
            thrust::make_permutation_iterator(
                    v.begin(),
                    thrust::make_transform_iterator(
                            thrust::counting_iterator<int>(0),
                            _1 / newChunk * oldChunk + _1 % newChunk))
                    + buf.size(),
            buf.begin());

    return 0;
}

I think the above version may not achieve the highest performance due to the use of mod operator %. For higher performance you may consider the cuBLAS function cublas_geam()

float alpha = 1;
float beta = 0;
cublasSgeam(handle, CUBLAS_OP_N, CUBLAS_OP_N,
            newChunk, size,
            &alpha,
            thrust::raw_pointer_cast(&v[0]), oldChunk,
            &beta,
            thrust::raw_pointer_cast(&v[0]), oldChunk,
            thrust::raw_pointer_cast(&buf[0]), newChunk);