14
votes

I have an array containing 100,000 sets. Each set contains natural numbers below 1,000,000. I have to find the number of ordered pairs {m, n}, where 0 < m < 1,000,000, 0 < n < 1,000,000 and m != n, which do not exist together in any of 100,000 sets. A naive method of searching through all the sets leads to 10^5 * (10^6 choose 2) number of searches.

For example I have 2 sets set1 = {1,2,4} set2 = {1,3}. All possible ordered pairs of numbers below 5 are {1,2}, {1,3}, {1,4}, {2,3}, {2,4} and {3,4}. The ordered pairs of numbers below 5 which do not exist together in set 1 are {1,3},{2,3} and {3,4}. The ordered pairs below 5 missing in set 2 are {1,2},{1,4},{2,3},{2,4} and {3,4}. The ordered pairs which do not exist together in both the sets are {2,3} and {3,4}. So the count of number of ordered pairs missing is 2.

Can anybody point me to a clever way of organizing my data structure so that finding the number of missing pairs is faster? I apologize in advance if this question has been asked before.

Update: Here is some information about the structure of my data set. The number of elements in each set varies from 2 to 500,000. The median number of elements is around 10,000. The distribution peaks around 10,000 and tapers down in both direction. The union of the elements in the 100,000 sets is close to 1,000,000.

6
Can we see your current way in a minimal reproducible example? A sample input with the desired output would be helpful to understand your situation as well.Khalil Khalaf
Is there a limit to how many elements are in each set? Where did you get this problem from? Is it from a programming competition? Do you have a reason to believe this can be solved efficiently?hugomg
Can you please elaborate about your expected output? Especially about {m, n}. I'm under an impression that you're looking for two numbers that don't exist in any sets.Leben Asa
Check your example. 2 and 3 are present yet you say (2,3) is missing??David Thomas
@DavidThomas: As I undrstand, OP look for pair which cannot be construct in any set, from first set {1,2,4}, we cannot do {1,3}, {2,3}, {3,4}. Second set only allow {1, 3}, so remaining pairs are {2,3}, {3,4}Jarod42

6 Answers

3
votes

If you are looking for combinations across sets, there is a way to meaningfully condense your dataset, as shown in frenzykryger's answer. However, from your examples, what you're looking for is the number of combinations available within each set, meaning each set contains irreducible information. Additionally, you can't use combinatorics to simply obtain the number of combinations from each set either; you ultimately want to deduplicate combinations across all sets, so the actual combinations matter.

Knowing all this, it is difficult to think of any major breakthroughs you could make. Lets say you have i sets and a maximum of k items in each set. The naive approach would be:

  • If your sets are typically dense (i.e. contain most of the numbers between 1 and 1,000,000), replace them with the complement of the set instead
  • Create a set of 2 tuples (use a set structure that ensures insertion is idempotent)
  • For each set O(i):
    • Evaluate all combinations and insert into set of combinations: O(k choose 2)

The worst case complexity for this isn't great, but assuming you have scenarios where a set either contains most of the numbers between 0 and 1,000,000, or almost none of them, you should see a big improvement in performance.

Another approach would be to go ahead and use combinatorics to count the number of combinations from each set, then use some efficient approach to find the number of duplicate combinations among sets. I'm not aware of such an approach, but it is possible it exists.

3
votes

First lets solve more simple task of counting number of elements not present in your sets. This task can be reworded in more simple form - instead of 100,000 sets you can think about 1 set which contains all your numbers. Then number of elements not present in this set is x = 1000000 - len(set). Now you can use this number x to count number of combinations. With repetitions: x * x, without repetitions: x * (x - 1). So bottom line of my answer is to put all your numbers in one big set and use it's length to find number of combinations using combinatorics.

Update

So above we have a way to find number of combinations where each element in combination is not in any of the sets. But question was to find number of combinations where each combination is not present in any of the sets.

Lets try to solve simpler problem first:

  • your sets have all numbers in them, none missing
  • each number is present exactly in one set, no duplicates across sets

How you would construct such combinations over such sets? You would simply pick two elements from different sets and resulting combination would not be in any of the sets. Number of such combinations could be counted using following code (it accepts sizes of the sets):

int count_combinations(vector<int>& buckets) {
  int result = 0;
  for (int i=0; i < buckets.size(); ++i) {
    for (int j=i+1; j < buckets.size(); ++j) {
      result += buckets[i] * buckets[j];
    }
  }
  return result;
}

Now let's imagine that some numbers are missing. Then we can just add additional set with those missing numbers to our sets (as a separate set). But we also need to account that given there were n missing numbers there would be n * (n-1) combinations constructed using only these missing numbers. So following code will produce total number of combinations with account to missing numbers:

int missing_numbers = upper_bound - all_numbers.size() - 1;
int missing_combinations = missing_numbers * (missing_numbers - 1);

return missing_combinations + count_combinations(sets, missing_numbers); 

Now lets imagine we have a duplicate across two sets: {a, b, c}, {a, d}. What types of errors they will introduce? Following pairs: {a, a} - repetition, {a, d} - combination which is present in second set. So how to treat such duplicates? We need to eliminate them completely from all sets. Even single instance of a duplicate will produce combination present in some set. Because we can just pick any element from the set where duplicate was removed and produce such combination (in my example - if we will keep a in first set, then pick d from the second to produce {a, d}, if we will keep a in second set, then pick b or c from the first to produce {a, b} and {a, c}). So duplicates shall be removed.

Update

However we can't simply remove all duplicates, consider this counterexample: {a, b} {a, c} {d}. If we simply remove a we will acquire {b} {c} {d} and lost information about not-existing combination {a, d}. Consider another counterexample: {a, b} {a, b, c} {b, d}. If we simply remove duplicates we will acquire {c} {d} and lost information about {a, d}.

Also we can't simply apply such logic to pairs of sets, a simple counter example for numbers < 3: {1, 2} {1} {2}. Here number of missing combinations is 0, but we will incorrectly count in {1, 2} if we will apply duplicates removal to pair of sets. Bottom line is that I can't come up with good technique which will help to correctly handle duplicate elements across sets.

2
votes

What you can do, depending on memory requirements, is take advantage of the ordering of Set, and iterate over the values smartly. Something like the code below (untested). You'll iterate over all of your sets, and then for each of your sets you'll iterate over their values. For each of these values, you'll check all of the values in the set after them. Our complexity is reduced to the number of sets times the square of their sizes. You can use a variety of methods to keep track of your found/unfound count, but using a set should be fine, since insertion is simply O(log(n)) where n is no more than 499999500000. In theory using a map of sets (mapping based on the first value) could be slightly faster, but in either case the cost is minimal.

long long numMissing(const std::array<std::set<int>, 100000>& sets){
    std::set<pair<int, int> > found;
    for (const auto& s : sets){
        for (const auto& m : s){
            const auto &n = m;
            for (n++; n != s.cend(); n++){
                found.emplace(m, n);
            }
        }
    }
    return 499999500000 - found.size();
}
1
votes

As an option you can build Bloom Filter(s) over your sets.

Before checking against all sets you can quickly lookup at your bloom filter and since it will never produce false negatives you can safely use your pair as its not present in your sets.

1
votes

Physically storing each possible pair would take too much memory. We have 100k sets and an average set has 10k numbers = 50M pairs = 400MB with int32 (and set<pair<int, int>> needs much more than 8 bytes per element).

My suggestion is based on two ideas:

  • don't store, only count the missing pairs
  • use interval set for compact storage and fast set operations (like boost interval set)

The algorithm is still quadratic on the number of elements in the sets but needs much less space.

Algorithm:

  • Create the union_set of the individual sets.
  • We also need a data structure, let's call it sets_for_number to answer this question: which sets contain a particular number? For the simplest case this could be unordered_map<int, vector<int>> (vector stores set indices 0..99999)
  • Also create the inverse sets for each set. Using interval sets this takes only 10k * 2 * sizeof(int) space per set on average.

    dynamic_bitset<> union_set = ...; //union of individual sets (can be vector<bool>)
    vector<interval_set<int>> inverse_sets = ...; // numbers 1..999999 not contained in each set
    int64_t missing_count = 0;
    
    for(int n = 1; n < 1000000; ++n)
        // count the missing pairs whose first element is n
        if (union_set.count(n) == 0) {
            // all pairs are missing
            missing_count += (999999 - n);
        } else {
            // check which second elements are not present
            interval_set<int> missing_second_elements = interval_set<int>(n+1, 1000000);
            // iterate over all sets containing n
            for(int set_idx: sets_for_number.find(n)) {
               // operator&= is in-place intersection
               missing_second_elements &= inverse_sets[set_idx];
            }
            // counting the number of pairs (n, m) where m is a number
            // that is not present in any of the sets containing n
            for(auto interval: missing_second_elements)
                missing_count += interval.size()
        }
    }
    
0
votes

If it is possible, have a set of all numbers and remove each of the number when you insert to your array of set. This will have a O(n) space complexity.

Of course if you don't want to have high spec complexity, maybe you can have a range vector. For each element in the vector, you have a pair of numbers which are the start/end of a range.