Performance of vector sort/unique/erase vs. copy to unordered_set

Question

I have a function that gets all neighbours of a list of points in a grid out to a certain distance, which involves a lot of duplicates (my neighbour's neighbour == me again).

I've been experimenting with a couple of different solutions, but I have no idea which is the more efficient. Below is some code demonstrating two solutions running side by side, one using std::vector sort-unique-erase, the other using std::copy into a std::unordered_set.

I also tried another solution, which is to pass the vector containing the neighbours so far to the neighbour function, which will use std::find to ensure a neighbour doesn't already exist before adding it.

So three solutions, but I can't quite wrap my head around which is gonna be faster. Any ideas anyone?

Code snippet follows:

// Vector of all neighbours of all modified phi points, which may initially include duplicates.
std::vector<VecDi> aneighs;
// Hash function, mapping points to their norm distance.
auto hasher = [&] (const VecDi& a) {
    return std::hash<UINT>()(a.squaredNorm() >> 2);
};
// Unordered set for storing neighbours without duplication.
std::unordered_set<VecDi, UINT (*) (const VecDi& a)> sneighs(phi.dims().squaredNorm() >> 2, hasher);

... compute big long list of points including many duplicates ...

// Insert neighbours into unordered_set to remove duplicates.
std::copy(aneighs.begin(), aneighs.end(), std::inserter(sneighs, sneighs.end()));

// De-dupe neighbours list.
// TODO: is this method faster or slower than unordered_set?
std::sort(aneighs.begin(), aneighs.end(), [&] (const VecDi& a, const VecDi&b) {
    const UINT aidx = Grid<VecDi, D>::index(a, phi.dims(), phi.offset());
    const UINT bidx = Grid<VecDi, D>::index(b, phi.dims(), phi.offset());
    return aidx < bidx;
});
aneighs.erase(std::unique(aneighs.begin(), aneighs.end()), aneighs.end());

Jerry Coffin Jerry Coffin · Accepted Answer · 2013-07-14T14:52:44

A great deal here is likely to depend on the size of the output set (which, in turn, will depend on how distant of neighbors you sample).

If it's small, (no more than a few dozen items or so) your hand-rolled set implementation using std::vector and std::find will probably remain fairly competitive. Its problem is that it's an O(N²) algorithm -- each time you insert an item, you have to search all the existing items, so each insertion is linear on the number of items already in the set. Therefore, as the set grows larger, its time to insert items grows roughly quadratically.

Using std::set you each insertion has to only do approximately log₂(N) comparisons instead of N comparison. That reduces the overall complexity from O(N²) to O(N log N). The major shortcoming is that it's (at least normally) implemented as a tree built up of individually allocated nodes. That typically reduces its locality of reference -- i.e., each item you insert will consist of the data itself plus some pointers, and traversing the tree means following pointers around. Since they're allocated individually, chances are pretty good that nodes that are (currently) adjacent in the tree won't be adjacent in memory, so you'll see a fair number of cache misses. Bottom line: while its speed grows fairly slowly as the number of items increases, the constants involved are fairly large -- for a small number of items, it'll start out fairly slow (typically quite a bit slower than your hand-rolled version).

Using a vector/sort/unique combines some of the advantages of each of the preceding. Storing the items in a vector (without extra pointers for each) typically leads to better cache usage -- items at adjacent indexes are also at adjacent memory locations, so when you insert a new item, chances are that the location for the new item will already be in the cache. The major disadvantage is that if you're dealing with a really large set, this could use quite a bit more memory. Where a set eliminates duplicates as you insert each item (i.e., an item will only be inserted if it's different from anything already in the set) this will insert all the items, then at the end delete all the duplicates. Given current memory availability and the number of neighbors I'd guess you're probably visiting, I doubt this is a major disadvantage in practice, but under the wrong circumstances, it could lead to a serious problem -- nearly any use of virtual memory would almost certainly make it a net loss.

Looking at the last from a complexity viewpoint, it's going to O(N log N), sort of like the set. The difference is that with the set it's really more like O(N log M), where N is the total number of neighbors, and M is the number of unique neighbors. With the vector, it's really O(N log N), where N is (again) the total number of neighbors. As such, if the number of duplicates is extremely large, a set could have a significant algorithmic advantage.

It's also possible to implement a set-like structure in purely linear sequences. This retains the set's advantage of only storing unique items, but also the vector's locality of reference advantage. The idea is to keep most of the current set sorted, so you can search it in log(N) complexity. When you insert a new item, however, you just put it in the separate vector (or an unsorted portion of the existing vector). When you do a new insertion you also do a linear search on those unsorted items.

When that unsorted part gets too large (for some definition of "too large") you sort those items and merge them into the main group, then start the same sequence again. If you define "too large" in terms of "log N" (where N is the number of items in the sorted group) you can retain O(N log N) complexity for the data structure as a whole. When I've played with it, I've found that the unsorted portion can be larger than I'd have expected before it starts to cause a problem though.

Performance of vector sort/unique/erase vs. copy to unordered_set

2 Answers