Faster way of searching array of sets

Question

I have an array containing 100,000 sets. Each set contains natural numbers below 1,000,000. I have to find the number of ordered pairs {m, n}, where 0 < m < 1,000,000, 0 < n < 1,000,000 and m != n, which do not exist together in any of 100,000 sets. A naive method of searching through all the sets leads to 10^5 * (10^6 choose 2) number of searches.

For example I have 2 sets set1 = {1,2,4} set2 = {1,3}. All possible ordered pairs of numbers below 5 are {1,2}, {1,3}, {1,4}, {2,3}, {2,4} and {3,4}. The ordered pairs of numbers below 5 which do not exist together in set 1 are {1,3},{2,3} and {3,4}. The ordered pairs below 5 missing in set 2 are {1,2},{1,4},{2,3},{2,4} and {3,4}. The ordered pairs which do not exist together in both the sets are {2,3} and {3,4}. So the count of number of ordered pairs missing is 2.

Can anybody point me to a clever way of organizing my data structure so that finding the number of missing pairs is faster? I apologize in advance if this question has been asked before.

Update: Here is some information about the structure of my data set. The number of elements in each set varies from 2 to 500,000. The median number of elements is around 10,000. The distribution peaks around 10,000 and tapers down in both direction. The union of the elements in the 100,000 sets is close to 1,000,000.

Can we see your current way in a minimal reproducible example? A sample input with the desired output would be helpful to understand your situation as well. — Khalil Khalaf
Is there a limit to how many elements are in each set? Where did you get this problem from? Is it from a programming competition? Do you have a reason to believe this can be solved efficiently? — hugomg
Can you please elaborate about your expected output? Especially about {m, n}. I'm under an impression that you're looking for two numbers that don't exist in any sets. — Leben Asa
Check your example. 2 and 3 are present yet you say (2,3) is missing?? — David Thomas
@DavidThomas: As I undrstand, OP look for pair which cannot be construct in any set, from first set {1,2,4}, we cannot do {1,3}, {2,3}, {3,4}. Second set only allow {1, 3}, so remaining pairs are {2,3}, {3,4} — Jarod42

Asad Saeeduddin Asad Saeeduddin · Accepted Answer · 2016-08-13T19:51:32

If you are looking for combinations across sets, there is a way to meaningfully condense your dataset, as shown in frenzykryger's answer. However, from your examples, what you're looking for is the number of combinations available within each set, meaning each set contains irreducible information. Additionally, you can't use combinatorics to simply obtain the number of combinations from each set either; you ultimately want to deduplicate combinations across all sets, so the actual combinations matter.

Knowing all this, it is difficult to think of any major breakthroughs you could make. Lets say you have i sets and a maximum of k items in each set. The naive approach would be:

If your sets are typically dense (i.e. contain most of the numbers between 1 and 1,000,000), replace them with the complement of the set instead
Create a set of 2 tuples (use a set structure that ensures insertion is idempotent)
For each set O(i):
- Evaluate all combinations and insert into set of combinations: O(k choose 2)

The worst case complexity for this isn't great, but assuming you have scenarios where a set either contains most of the numbers between 0 and 1,000,000, or almost none of them, you should see a big improvement in performance.

Another approach would be to go ahead and use combinatorics to count the number of combinations from each set, then use some efficient approach to find the number of duplicate combinations among sets. I'm not aware of such an approach, but it is possible it exists.

Faster way of searching array of sets

6 Answers