8
votes

While optimizing performance critical code, I noticed iterating over a std::set was a bit slow.

I then wrote a benchmarker and tested the speeds of iteration over a vector by iterator (auto it : vector), iterating over a set by iterator, and iterating over a vector by index (int i = 0; i < vector.size(); ++i).

The containers are constructed identically, with 1024 random ints. (Of course, each int is unique since we're working with sets). Then, for each run, we loop through the container and sum their ints into a long int. Each run has 1000 iterations doing the sum, and the test was averaged over 1000 runs.

Here are my results:

Testing vector by iterator
✓           
Maximum duration: 0.012418
Minimum duration: 0.007971
Average duration: 0.008354

Testing vector by index
✓           
Maximum duration: 0.002881
Minimum duration: 0.002094
Average duration: 0.002179

Testing set by iterator
✓           
Maximum duration: 0.021862
Minimum duration: 0.014278
Average duration: 0.014971

As we can see, iterating over a set by iterator is 1.79x slower than by vector, and a whopping 6.87x slower than vector by index.

What is going on here? Isn't a set just a structured vector that checks whether each item is unique upon insertion? Why should it be so much slower?

Edit: Thank you for your replies! Good explanations. By request, here is the code of the benchmark.

#include <chrono>
#include <random>
#include <string>
#include <functional>
#include <set>
#include <vector>

void benchmark(const char* name, int runs, int iterations, std::function<void(int)> func) {
    printf("Testing %s\n", name);

    std::chrono::duration<double> min = std::chrono::duration<double>::max();
    std::chrono::duration<double> max = std::chrono::duration<double>::min();
    std::chrono::duration<double> run = std::chrono::duration<double>::zero();
    std::chrono::duration<double> avg = std::chrono::duration<double>::zero();

    std::chrono::high_resolution_clock::time_point t1;
    std::chrono::high_resolution_clock::time_point t2;

    // [removed] progress bar code
    for (int i = 0; i < runs; ++i) {
        t1 = std::chrono::high_resolution_clock::now();

        func(iterations);

        t2 = std::chrono::high_resolution_clock::now();

        run = std::chrono::duration_cast<std::chrono::duration<double>>(t2 - t1);

        // [removed] progress bar code

        if (run < min) min = run;
        if (run > max) max = run;   
        avg += run / 1000.0;
    }
    // [removed] progress bar code

    printf("Maximum duration: %f\n", max.count());
    printf("Minimum duration: %f\n", min.count());
    printf("Average duration: %f\n", avg.count());

    printf("\n");
}

int main(int argc, char const *argv[]) {
    const unsigned int arrSize = 1024;

    std::vector<int> vector; vector.reserve(arrSize);
    std::set<int> set;

    for (int i = 0; i < arrSize; ++i) {
        while (1) {
            int entry = rand() - (RAND_MAX / 2);
            auto ret = set.insert(entry);
            if (ret.second) {
                vector.push_back(entry);
                break;          
            }
        }
    }

    printf("Created vector of size %lu, set of size %lu\n", vector.size(), set.size());

    benchmark("vector by iterator", 1000, 1000, [vector](int runs) -> void {
        for (int i = 0; i < runs; ++i) {
            long int sum = 0;

            for (auto it : vector) {
                sum += it;
            }
        }
    });

    benchmark("vector by index", 1000, 1000, [vector, arrSize](int runs) -> void {
        for (int i = 0; i < runs; ++i) {
            long int sum = 0;

            for (int j = 0; j < arrSize; ++j) {
                sum += vector[j];
            }
        }
    });

    benchmark("set by iterator", 1000, 1000, [set](int runs) -> void {
        for (int i = 0; i < runs; ++i) {
            long int sum = 0;

            for (auto it : set) {
                sum += it;
            }
        }
    });

    return 0;
}

I'm working on posting the results using O2, but I'm trying to get the compiler to avoid optimizing away the sum.

4
The set is typically implemented as some form of balanced tree (such as red-black tree), so iterating over it requires a lot of pointer-chasing.Branko Dimitrijevic
By iterator and by index shouldn't be that different if you compile with optimizations on. The difference with set is understable because of 1) indirection (a pointer has to be read and followed to reach the next element) 2) memory locality, (elements might be far away in memory). Given the size of your test, it is most likely 1). Also, post your tests so people can see if the timing is fair/fine.alfC
"Isn't a set just a structured vector that checks whether each item is unique upon insertion?" No, if it is that what you are thinking is boost::flat_set boost.org/doc/libs/1_70_0/doc/html/boost/container/…alfC
For the record, without using optimization, all of your numbers are pretty useless. Iterating a vector by index or by iterator should be equally performant, but without optimizations turned on the iterator approach may not inline a bunch of stuff that definitely should be inlined. sets are almost certainly slower, but you're going to have a hard time saying how much slower they are in practice when compiling without optimization.ShadowRanger

4 Answers

15
votes

Isn't a set just a structured vector that checks whether each item is unique upon insertion?

No, by far not. These data structures are completely different, and the main distinction here is the memory layout: std::vector puts its element into a contiguous location in memory, while std::set is a node-based container, where every element is separately allocated and resides at distinct places in memory, possibly far away from each other and definitely in a way that pre-fetching data for fast traversal is impossible for the processor. This is quite the opposite for std::vector - as the next element is always just right "next to" the current one in memory, a CPU will load elements into its cache, and when actually processing the elements, it only has to go to the cache to retrieve the values - which is very fast compared to RAM access.

Note that it's a common need to have a sorted, unique collection of data that is laid out contiguously in memory, and C++2a or the version thereafter might actually ship with a flat_set, have a look at P1222.

Matt Austern's "Why you shouldn't use set (and what you should use instead)" is an interesting read, too.

6
votes

The main reason is that when you iterate over a std::vector which stores its element in a contiguous memory chuck you basically do:

++p;

where p is a T* raw pointer. The stl code is:

 __normal_iterator&
 operator++() _GLIBCXX_NOEXCEPT
 {
    ++_M_current;                            // <--- std::vector<>: ++iter
    return *this;
 }

For a std::set, the underlying object is more complex and in most implementations you iterate over a tree like structure. In its simplest form this is something like:

p=p->next_node;

where p is a pointer over a tree node structure:

struct tree_node {
   ...
   tree_node *next_node;
};

but in practice the "real" stl code is much more complex:

_Self&
operator++() _GLIBCXX_NOEXCEPT
{
    _M_node = _Rb_tree_increment(_M_node);   // <--- std::set<> ++iter
    return *this;
}

// ----- underlying code \/\/\/

static _Rb_tree_node_base*
local_Rb_tree_increment(_Rb_tree_node_base* __x) throw ()
{
  if (__x->_M_right != 0) 
    {
      __x = __x->_M_right;
      while (__x->_M_left != 0)
        __x = __x->_M_left;
    }
  else 
    {
      _Rb_tree_node_base* __y = __x->_M_parent;
      while (__x == __y->_M_right) 
        {
          __x = __y;
          __y = __y->_M_parent;
        }
      if (__x->_M_right != __y)
        __x = __y;
    }
  return __x;
}

_Rb_tree_node_base*
_Rb_tree_increment(_Rb_tree_node_base* __x) throw ()
{
  return local_Rb_tree_increment(__x);
}

const _Rb_tree_node_base*
_Rb_tree_increment(const _Rb_tree_node_base* __x) throw ()
{
  return local_Rb_tree_increment(const_cast<_Rb_tree_node_base*>(__x));
}

(see: What is the definition of _Rb_tree_increment in bits/stl_tree.h?)

2
votes

First of all you should note, that a std::set is sorted. This is typically achieved by storing the data in a tree-like structure.

A vector is typically stored in a contiguous memory area (like a simple array) which can therefore be cached. And this is why it is faster.

1
votes

std::vector is a contiguous structure. The elements are all laid out in memory sequentially, so iterating it only requires an addition and a single pointer lookup per element. Additionally it's very cache-friendly since retrieving an element will generally cause a whole chunk of the vector to be loaded into cache.

std::set is a node-based structure; generally a red-black tree. Iterating over it is more involved and requires chasing several pointers per element. It's also not very cache-friendly since the elements aren't necessarily near each other in memory.