35
votes

I have a vector containing few non-adjacent duplicates.

As a simple example, consider:

2 1 6 1 4 6 2 1 1

I am trying to make this vector unique by removing the non-adjacent duplicates and maintaining the order of elements.

Result would be:

2 1 6 4 

The solutions I tried are:

  1. Inserting into a std::set but the problem with this approach is that it will disturb the order of elements.
  2. Use the combination of std::sort and std::unique. But again same order problem.
  3. Manual duplicate elimination:

        Define a temporary vector TempVector.
        for (each element in a vector)
        {
            if (the element does not exists in TempVector)
            {
                add to TempVector;
            }
        }
        swap orginial vector with TempVector.
    

My question is:

Is there any STL algorithm which can remove the non-adjacent duplicates from the vector ? what is its complexity?

11
Since constructing a set from the elements would cost O(n log n), it cannot be possible to construct one and also retain the order seen. So, pick an O(n log n) and go with that.Phil H

11 Answers

13
votes

I think you would do it like this:

I would use two iterators on the vector :

The first of one reads the data and inserts it a temporary set.

When the read data was not in the set you copy it from the first iterator to the second and increment it.

At the end you keep only the data up to the second iterator.

The complexity is O( n .log( n ) ) as the lookup for duplicated elements uses the set, not the vector.

#include <vector>
#include <set>
#include <iostream>

int main(int argc, char* argv[])
{
    std::vector< int > k ;

    k.push_back( 2 );
    k.push_back( 1 );
    k.push_back( 6 );
    k.push_back( 1 );
    k.push_back( 4 );
    k.push_back( 6 );
    k.push_back( 2 );
    k.push_back( 1 );
    k.push_back( 1 );

{
    std::vector< int >::iterator r , w ;

    std::set< int > tmpset ;

    for( r = k.begin() , w = k.begin() ; r != k.end() ; ++r )
    {
        if( tmpset.insert( *r ).second )
        {
            *w++ = *r ;
        }
    }

    k.erase( w , k.end() );
}


    {
        std::vector< int >::iterator r ;

        for( r = k.begin() ; r != k.end() ; ++r )
        {
            std::cout << *r << std::endl ;
        }
    }
}
13
votes

Without using a temporary set it's possible to do this with (possibly) some loss of performance:

template<class Iterator>
Iterator Unique(Iterator first, Iterator last)
{
    while (first != last)
    {
        Iterator next(first);
        last = std::remove(++next, last, *first);
        first = next;
    }

    return last;
}

used as in:

vec.erase( Unique( vec.begin(), vec.end() ), vec.end() );

For smaller data sets, the implementation simplicity and lack of extra allocation required may offset the theoretical higher complexity of using an additional set. Measurement with a representative input is the only way to be sure, though.

7
votes

As the question was "is there any STL algorithm...? what is its complexity?" it makes sense to implement the function like std::unique:

template <class FwdIterator>
inline FwdIterator stable_unique(FwdIterator first, FwdIterator last)
{
    FwdIterator result = first;
    std::unordered_set<typename FwdIterator::value_type> seen;

    for (; first != last; ++first)
        if (seen.insert(*first).second)
            *result++ = *first;
    return result;
}

So this is how std::unique is implemented plus an extra set. The unordered_set shall be faster than a regular set. All elements are removed that compare equal to the element right preceding them (the first element is kept because we cannot unify to nothing). The iterator returned points to the new end within the range [first,last).

EDIT: The last sentence means that the container itself is NOT modified by unique. This can be confusing. The following example actually reduces the container to the unified set.

1: std::vector<int> v(3, 5);
2: v.resize(std::distance(v.begin(), unique(v.begin(), v.end())));
3: assert(v.size() == 1);

Line 1 creates a vector { 5, 5, 5 }. In line 2 calling unique returns an iterator to the 2nd element, which is the first element that is not unique. Hence distance returns 1 and resize prunes the vector.

6
votes

You can remove some of the loops in fa's answer using remove_copy_if:

class NotSeen : public std::unary_function <int, bool>
{
public:
  NotSeen (std::set<int> & seen) : m_seen (seen) { }

  bool operator ()(int i) const  {
    return (m_seen.insert (i).second);
  }

private:
  std::set<int> & m_seen;
};

void removeDups (std::vector<int> const & iv, std::vector<int> & ov) {
  std::set<int> seen;
  std::remove_copy_if (iv.begin ()
      , iv.end ()
      , std::back_inserter (ov)
      , NotSeen (seen));
}

This has no affect on the complexity of the algorithm (ie. as written it's also O(n log n)). You can improve upon this using unordered_set, or if the range of your values is small enough you could simply use an array or a bitarray.

3
votes

There's no STL algorithm doing what you want preserving the sequence's original order.

You could create a std::set of iterators or indexes into the vector, with a comparison predicate that uses the referenced data rather than the iterators/indexes to sort stuff. Then you delete everything from the vector that isn't referenced in the set. (Of course, you could just as well use another std::vector of iterators/indexes, std::sort and std::unique that, and use this as a reference as to what to keep.)

3
votes

Based on the answer of @fa. It can also get rewritten using the STL algorithm std::stable_partition:

struct dupChecker_ {
    inline dupChecker_() : tmpSet() {}
    inline bool operator()(int i) {
        return tmpSet.insert(i).second;
    }
private:
    std::set<int> tmpSet;
};

k.erase(std::stable_partition(k.begin(), k.end(), dupChecker_()), k.end());

This way it is more compact and we don't need to care of the iterators.

It seems even not to introduce to much performance penalty. I use it in my project which needs to handle quite large vectors of complex types often and it makes no real difference.

Another nice feature is, that it is possible to adjust the uniqueness by using std::set<int, myCmp_> tmpSet;. For instance, in my project to ignore certain rounding errors.

2
votes

There is a nice article by John Torjo which deals with this very question in a systematic way. The result he comes up with seems more generic and more efficient than any of the solutions suggested here so far:

http://www.builderau.com.au/program/java/soa/C-Removing-duplicates-from-a-range/0,339024620,320271583,00.htm

https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-1052159.html

Unfortunately, the complete code of John's solution seems to be no longer available, and John did not respond to may email. Therefore, I wrote my own code which is based on similar grounds like his, but intentionally differs in some details. Feel free to contact me (vschoech think-cell com) and discuss the details if you wish.

To make the code compile for you, I added some of my own library stuff which I use regularly. Also, instead of going with plain stl, I use boost a lot to create more generic, more efficient, and more readable code.

Have fun!

#include <vector>
#include <functional>

#include <boost/bind.hpp>
#include <boost/range.hpp>
#include <boost/iterator/counting_iterator.hpp>

/////////////////////////////////////////////////////////////////////////////////////////////
// library stuff

template< class Rng, class Func >
Func for_each( Rng& rng, Func f ) {
    return std::for_each( boost::begin(rng), boost::end(rng), f );
};

template< class Rng, class Pred >
Rng& sort( Rng& rng, Pred pred ) {
    std::sort( boost::begin( rng ), boost::end( rng ), pred );
    return rng; // to allow function chaining, similar to operator+= et al.
}

template< class T >
boost::iterator_range< boost::counting_iterator<T> > make_counting_range( T const& tBegin, T const& tEnd ) {
    return boost::iterator_range< boost::counting_iterator<T> >( tBegin, tEnd );
}

template< class Func >
class compare_less_impl {
private:
    Func m_func;
public:
    typedef bool result_type;
    compare_less_impl( Func func ) 
    :   m_func( func )
    {}
    template< class T1, class T2 > bool operator()( T1 const& tLeft, T2 const& tRight ) const {
        return m_func( tLeft ) < m_func( tRight );
    }
};

template< class Func >
compare_less_impl<Func> compare_less( Func func ) {
    return compare_less_impl<Func>( func );
}


/////////////////////////////////////////////////////////////////////////////////////////////
// stable_unique

template<class forward_iterator, class predicate_type>
forward_iterator stable_unique(forward_iterator itBegin, forward_iterator itEnd, predicate_type predLess) {
    typedef std::iterator_traits<forward_iterator>::difference_type index_type;
    struct SIteratorIndex {
        SIteratorIndex(forward_iterator itValue, index_type idx) : m_itValue(itValue), m_idx(idx) {}
        std::iterator_traits<forward_iterator>::reference Value() const {return *m_itValue;}
        index_type m_idx;
    private:
        forward_iterator m_itValue;
    };

    // {1} create array of values (represented by iterators) and indices
    std::vector<SIteratorIndex> vecitidx;
    vecitidx.reserve( std::distance(itBegin, itEnd) );
    struct FPushBackIteratorIndex {
        FPushBackIteratorIndex(std::vector<SIteratorIndex>& vecitidx) : m_vecitidx(vecitidx) {}
        void operator()(forward_iterator itValue) const {
            m_vecitidx.push_back( SIteratorIndex(itValue, m_vecitidx.size()) );
        }
    private:
        std::vector<SIteratorIndex>& m_vecitidx;
    };
    for_each( make_counting_range(itBegin, itEnd), FPushBackIteratorIndex(vecitidx) );

    // {2} sort by underlying value
    struct FStableCompareByValue {
        FStableCompareByValue(predicate_type predLess) : m_predLess(predLess) {}
        bool operator()(SIteratorIndex const& itidxA, SIteratorIndex const& itidxB) {
            return m_predLess(itidxA.Value(), itidxB.Value())
                // stable sort order, index is secondary criterion
                || !m_predLess(itidxB.Value(), itidxA.Value()) && itidxA.m_idx < itidxB.m_idx;
        }
    private:
        predicate_type m_predLess;
    };
    sort( vecitidx, FStableCompareByValue(predLess) );

    // {3} apply std::unique to the sorted vector, removing duplicate values
    vecitidx.erase(
        std::unique( vecitidx.begin(), vecitidx.end(),
            !boost::bind( predLess,
                // redundand boost::mem_fn required to compile
                boost::bind(boost::mem_fn(&SIteratorIndex::Value), _1),
                boost::bind(boost::mem_fn(&SIteratorIndex::Value), _2)
            )
        ),
        vecitidx.end()
    );

    // {4} re-sort by index to match original order
    sort( vecitidx, compare_less(boost::mem_fn(&SIteratorIndex::m_idx)) );

    // {5} keep only those values in the original range that were not removed by std::unique
    std::vector<SIteratorIndex>::iterator ititidx = vecitidx.begin();
    forward_iterator itSrc = itBegin;
    index_type idx = 0;
    for(;;) {
        if( ititidx==vecitidx.end() ) {
            // {6} return end of unique range
            return itSrc;
        }
        if( idx!=ititidx->m_idx ) {
            // original range must be modified
            break;
        }
        ++ititidx;
        ++idx;
        ++itSrc;
    }
    forward_iterator itDst = itSrc;
    do {
        ++idx;
        ++itSrc;
        // while there are still items in vecitidx, there must also be corresponding items in the original range
        if( idx==ititidx->m_idx ) {
            std::swap( *itDst, *itSrc ); // C++0x move
            ++ititidx;
            ++itDst;
        }
    } while( ititidx!=vecitidx.end() );

    // {6} return end of unique range
    return itDst;
}

template<class forward_iterator>
forward_iterator stable_unique(forward_iterator itBegin, forward_iterator itEnd) {
    return stable_unique( itBegin, itEnd, std::less< std::iterator_traits<forward_iterator>::value_type >() );
}

void stable_unique_test() {
    std::vector<int> vecn;
    vecn.push_back(1);
    vecn.push_back(17);
    vecn.push_back(-100);
    vecn.push_back(17);
    vecn.push_back(1);
    vecn.push_back(17);
    vecn.push_back(53);
    vecn.erase( stable_unique(vecn.begin(), vecn.end()), vecn.end() );
    // result: 1, 17, -100, 53
}
2
votes

My question is:

Is there any STL algorithm which can remove the non-adjacent duplicates from the vector ? what is its complexity?

The STL options are the ones you mentioned: put the items in a std::set, or call std::sort, std::unique and calling erase() on the container. Neither of these fulfills your requirement of "removing the non-adjacent duplicates and maintaining the order of elements."

So why doesn't the STL offer some other option? No standard library will offer everything for every user's needs. The STL's design considerations include "be fast enough for nearly all users," "be useful for nearly all users," and "provide exception safety as much as possible" (and "be small enough for the Standards Committee" as the library Stepanov originally wrote was much larger, and Stroustrup axed out something like 2/3 of it).

The simplest solution I can think of would look like this:

// Note:  an STL-like method would be templatized and use iterators instead of
// hardcoding std::vector<int>
std::vector<int> stable_unique(const std::vector<int>& input)
{
    std::vector<int> result;
    result.reserve(input.size());
    for (std::vector<int>::iterator itor = input.begin();
                                    itor != input.end();
                                    ++itor)
        if (std::find(result.begin(), result.end(), *itor) == result.end())
            result.push_back(*itor);
        return result;
}

This solution should be O(M * N) where M is the number of unique elements and N is the total number of elements.

1
votes

As far as i know there is none in stl. Look up reference.

1
votes

Based on the @Corden's answer, but uses lambda expression and removes duplicates instead of writing them in the output vector

    set<int> s;
    vector<int> nodups;
    remove_copy_if(v.begin(), v.end(), back_inserter(nodups), 
        [&s](int x){ 
            return !s.insert(x).second; //-> .second false means already exists
        } ); 
0
votes

Given your your input is in vector<int> foo you can use remove to do the leg work for you here, then if you want to shrink the vector just use erase otherwise just use last as your one-past-the-end iterator when you want the vector with duplicates removed but order preserved:

auto last = end(foo);

for(auto first = begin(foo); first < last; ++first) last = remove(next(first), last, *first);

foo.erase(last, end(foo));

Live Example

As far as time complexity this will be O(nm). Where n is the number of elements and m is the number of unique elements. As far as space complexity this will use O(n) because it does the removal in place.