Insert a sorted range into std::set with hint

Question

Assume I have a std::set (which is by definition sorted), and I have another range of sorted elements (for the sake of simplicity, in a different std::set object). Also, I have a guarantee that all values in the second set are larger than all the values in the first set.

I know I can efficiently insert one element into std::set - if I pass a correct hint, this will be O(1). I know I can insert any range into std::set, but as no hint is passed, this will be O(k logN) (where k is number of new elements, and N number of old elements).

Can I insert a range in a std::set and provide a hint? The only way I can think of is to do k single inserts with a hint, which does push the complexity of the insert operations in my case down to O(k):

std::set <int> bigSet{1,2,5,7,10,15,18};
std::set <int> biggerSet{50,60,70};  

for(auto bigElem : biggerSet)
    bigSet.insert(bigSet.end(), bigElem);

Why would k hinted insertions of complexity O(1) result in an O(kN) complexity operation? Isn't it O(k) -pun not intended? — papagaga
@papagaga Oh, well, that's because k times 1 equals kN for sufficiently large values of 1 :) Sorry, typo, I edited it out. — penelope
I wonder if std::set is what you're looking for. Using it to keep sorted collections isn't a good idea. You use std::set when you need to rely on the property that insertions don't invalidate iterators and references. To keep sorted collections, std::vector and <algorithm> generally are the best choice. And what you ask for is a no brainer on two vectors — papagaga
@papagaga I am actually working with std::maps, which represent, in a way, very sparse histograms. I calculate several of these histograms (about 10, not thousands). The calculation is the complicated bit, where I need to add and update many, many elements out of order, so I need find/insert O(logN) and the std::map is the correct data structure for this. After each calculation, I need to update the "globalHistogram" by adding the information of the current histogram to it which is what my question refers to. I asked the question with std::set for simplicity instead. — penelope
@papagaga Currently my construction is O(NlogN). Working with a vector, I would need to at least std::find before every insertion to check for duplicates (80%+ of insertions are duplicates; so keeping everything and then filtering unique is horrible memory-wise), and sort it in the end - O(N^2 + NlogN). I would also need to std::find every time I needed to read a value from the sparse histogram, which is again not very ideal. — penelope

Jerry Coffin Jerry Coffin · Accepted Answer · 2018-04-13T04:06:18

First of all, to do the merge you're talking about, you probably want to use set (or map's) merge member function, which will let you merge some existing map into this one. The advantage of doing this (and the reason you might not want to, depending your usage pattern) is that the items being merged in are actually moved from one set to the other, so you don't have to allocate new nodes (which can save a fair amount of time). The disadvantage is that the nodes then disappear from the source set, so if you need each local histogram to remain intact after being merged into the global histogram, you don't want to do this.

You can typically do better than O(log N) when searching a sorted vector. Assuming reasonably predictable distribution you can use an interpolating search to do a search in (typically) around O(log log N), often called "pseudo-constant" complexity.

Given that you only do insertions relatively infrequently, you might also consider a hybrid structure. This starts with a small chunk of data that you don't keep sorted. When you reach an upper bound on its size, you sort it and insert it into a sorted vector. Then you go back to adding items to your unsorted area. When it reaches the limit, again sort it and merge it with the existing sorted data.

Assuming you limit the unsorted chunk to no larger than log(N), search complexity is still O(log N)--one log(n) binary search or log log N interpolating search on the sorted chunk, and one log(n) linear search on the unsorted chunk. Once you've verified that an item doesn't exist yet, adding it has constant complexity (just tack it onto the end of the unsorted chunk). The big advantage is that this can still easily use a contiguous structure such as a vector, so it's much more cache friendly than a typical tree structure.

Since your global histogram is (apparently) only ever populated with data coming from the local histograms, it might be worth considering just keeping it in a vector, and when you need to merge in the data from one of the local chunks, just use std::merge to take the existing global histogram and the local histogram, and merge them together into a new global histogram. This has O(N + M) complexity (N = size of global histogram, M = size of local histogram). Depending on the typical size of a local histogram, this could pretty easily work out as a win.

Insert a sorted range into std::set with hint

3 Answers