0
votes

In my project I've to copy a lot of numerical data in an std::valarray (or std::vector) from a CUDA (GPU) device (from the memory of the video-card to std::valarray).

So I need to resize these data-structures as faster as possible but when I call the member method vector::resize it initialize all elements of the array to the default value, with a loop.

// In a super simplified description resize behave like this pseudocode:
vector<T>::resize(N){
   // Setup the new size

   // allocate the new array
   this->_internal_vector = new T[N];

   // init to default
   // This loop is slow !!!!
   for ( i = 0; i < N ; ++i){
      this->_internal_vector[i] = T();
   }
}

Clearly I don't need this initialization because I've to copy data from the GPU and all old data are overwritten. And the initialization require some time; so I've a loss of performance.

For coping the data I need allocated memory; generated by the method resize().

I very dirty and wrong solution is to use the method vector::reserve(), but I lost all the features of the vector; and if I resize the data are replaced with the default value.

So, if you know, there exists a strategy for avoiding this pre-initialization to the default value (in valarray or vector).

I want a method resize that behave like this:
vector<T>::resize(N) {
    // Allocate the memory.
    this->_internal_vector = new T[N];

    // Update the the size of the vector or valarray

    // !! DO NOT initialize the new values.
}

An example of the performances:

#include <chrono>
#include <iostream>
#include <valarray>
#include <vector>

int main() {

  std::vector<double> vec;
  std::valarray<double> vec2;

  double *vec_raw;

  unsigned int N = 100000000;

  std::clock_t start;
  double duration;

  start = std::clock();
  // Dirty solution!
  vec.reserve(N);

  duration = (std::clock() - start) / (double)CLOCKS_PER_SEC;
  std::cout << "duration reserve: " << duration << std::endl;

  start = std::clock();

  vec_raw = new double[N];

  duration = (std::clock() - start) / (double)CLOCKS_PER_SEC;
  std::cout << "duration new: " << duration << std::endl;

  start = std::clock();

  for (unsigned int i = 0; i < N; ++i) {
    vec_raw[i] = 0;
  }

  duration = (std::clock() - start) / (double)CLOCKS_PER_SEC;
  std::cout << "duration raw init: " << duration << std::endl;

  start = std::clock();
  // Dirty solution
  for (unsigned int i = 0; i < vec.capacity(); ++i) {
    vec[i] = 0;
  }

  duration = (std::clock() - start) / (double)CLOCKS_PER_SEC;
  std::cout << "duration vec init dirty: " << duration << std::endl;

  start = std::clock();

  vec2.resize(N);

  duration = (std::clock() - start) / (double)CLOCKS_PER_SEC;
  std::cout << "duration valarray resize: " << duration << std::endl;

  return 0;
}

Output:

duration reserve: 1.1e-05
duration new: 1e-05
duration raw init: 0.222263
duration vec init dirty: 0.214459
duration valarray resize: 0.215735

Note: replacing the std::allocator does not work because the loop is called by the resize().

2
Your initialization of the vector vec is wrong! The reserve function only allocates memory, but the actual size is still unchanged. That means you index out of bounds and have undefined behavior.Some programmer dude
Also, if you want to set all elements of an array (actual or dynamically allocated) or vector to a single value, use std::fill or std::fill_n instead of explicit loops. You could also use std::memset in both cases.Some programmer dude
@Some programmer dude Yes it is wrong! But it is fast.Giggi
@Some programmer dude I need a block of raw allocated memory (like the old styled malloc()) but generated in a std::vector. Coping memory from a video-card to a vector with the c++ standard libs it's impossible.Giggi
It doesn't matter if it's "fast". Wrong is still wrong, and you're very lucky it seems to work for you. Another compiler, or even a new version of the one you have, might lead to your program crashing unexpectedly (and maybe not even there).Some programmer dude

2 Answers

3
votes

Let's say you have an array (or some collection) with the data called data and you want to copy it to a vector vec. Then the idiomatic way to do this would be to use std::vector::reserve and then std::vector::push_back. std::vector::reserve will allocate memory for the std::vector but it will not initialize the memory, or set the internal counter etc. std::vector::push_back will insert the data and update the vector's size. Optionally, use std::vector::insert that takes two iterators, to avoid looping and pushing back every element individually.

std::vector<double> vec;
vec.reserve(std::size(data)); // Allocate all data in one call.
vec.insert(std::begin(vec), std::begin(data), std::end(data)); // Insert the data elements.

Alternatively you can use std::vector's ctor overload that takes two iterators:

std::vector<double> vec{std::begin(data), std::end(data)};

This will also allocate all data in a single call, and then add the elements.

Update

If you know the data size in advance, you could simply use std::array, e.g.:

constexpr const std::size_t N = 10'000;
std::array<double, N> arr;

arr[5432] = 2.5; // Perfectly valid.
// Or e.g. for CUDA.
cudaMemcpy(std::data(arr), gpu_arr, std::size(arr), cudaMemcpyDeviceToHost);

All data will be allocated at once, and no default initialization will be performed (values are default initialized, but for fundamental types this means nothing is done [indeterminate values]).

std::array has all the advantages of C++ collections as std::size, std::begin, std::end, std::data etc.

1
votes

If you are working with plain old data (no pointers or references, just integers and floats), it may be best to just use a plain old array. Combine that with correct use of memcpy(), and you are pretty much guaranteed to get much better performance than any native C++ implementation.

The point is, that C++ cannot really handle swaths of data as swaths of data. It has to handle individual objects of unknown type. It does not know whether these objects may be copied by copying their bits, it must call the adequate default, copy, or move constructors, (move) assignment operators, and destructor for each individual element. While good C++ compilers are able to remove much of the resulting garbage code, the result generally cannot compete with the carefully hand-optimized implementations of memcpy() that can just copy in chunks of 16 or more bytes, blissfully ignorant of whether these are actually eight shorts, two doubles, or 1.33 instances of struct { float x,y,z; }.