0
votes

Some Background

I am working on a problem where I have sets stored in a hashmap with keys being the set name, i.e. Set1--> a,b,c,e,g .... Set2--> a,g,h,f ... Set3--> b,c,e ... etc.

The aim of the program is take a value from the user as a "threshold" i.e. 2, which is used a the minimum commonality between sets. if the threshold is met or exceeded, the program suggests a merge between the sets.

I have created a combination creator that will generate every possible combination between set names for comparison with order not considered i.e. (Set1, Set2,),(Set1,Set3),(Set2,Set3), (Set1,Set2,Set3).

These sets of combinations are then used to actually compare the sets. If the Threshold is met, this combination is store in a seperate list to output to the user as a possible merge. Before this is outputted, these is some logic to delete child combinations i.e. if (Set1,Set2,Set3) is a possible merge, then you can disregard, the other 3 child combinations as this super combination already covers it. we then output the suggested merges.

The Problem

When we reach a certain number of sets to compare i.e. above 17 let's say, we get an out of memory issue because there are millions of combinations being created. I would like your help on understanding alternative approaches or how we could improve this approach. It works but it's not efficient enough :(

Combination Creator

/**
 * Iterates through the setsToBeCompared ArrayList and gets all the combinations
 *
 * @return - ArrayList with all the possible combinations
 */
public ArrayList<String> generateCombinations(ArrayList<String> setsToBeCompared) {
    List<List<String>> temp = new ArrayList<>();
    ArrayList<String> a = new ArrayList<>();
    for (int i = 2; i <= 3; i++) {
        temp = calculateCombinations(setsToBeCompared, i);
        for (List<String> list : temp) {
            a.add(list.toString());
        }                       
    }
    return a;
        }

/**
 * Calculates all the combination given by the parameters
 *
 * @param values - the names of the sets to be compared
 * @param size   - where to start from
 * @return - List of all possible calculated combinations
 */
private List<List<String>> calculateCombinations(List<String> values, int size) {

    if (0 == size) {
        return Collections.singletonList(Collections.<String>emptyList());
    }

    if (values.isEmpty()) {
        return Collections.emptyList();
    }

    List<List<String>> combination = new LinkedList<List<String>>();

    String actual = values.iterator().next();
    List<String> subSet = new LinkedList<String>(values);
    subSet.remove(actual);
    List<List<String>> subSetCombination = calculateCombinations(subSet, size - 1);
    for (List<String> set : subSetCombination) {
        List<String> newSet = new LinkedList<String>(set);
        newSet.add(0, actual);
        combination.add(newSet);
    }

    combination.addAll(calculateCombinations(subSet, size));

    return combination;
}
2
Could you please be clearer about when exactly you want to merge sets? - J Fabian Meier
@JFMeier Sure... the program doesn't actually merge the sets it tells the user that these sets can be merged together as they have a commonality that meets the threshold they defined. This comparison between sets is done once we have created the combinations. The combinations are fed into the comparison algorithm. From this algorithm we get a new list of combinations that have met the threshold. - Zyzz
also, once we have this new list, we then condense the list down but getting rid of the child sets within the super sets of combinations. i.e. if Set1,Set2,Set3 can be merged, we can ignore the combinations Set1,Set2 etc - Zyzz
You might consider to not explicitly create the sets of all combinations. You could use an iterator instead. In these classes I implemented several iterators over different sorts of combinations (the PowerSetIterable might be the one you need). But note that the running time still grows exponentially. This will be a problem, maybe not with 17, but with 18 or 19 sets. Otherwise, you'll have to consider smarter approaches (and therefore, describe the problem setting in more detail) - Marco13
@SashaSalauyou that is true, but the reason I am creating all possible combinations is because if the largest subset does not meet the threshold very likely not impossible) then we should consider the others - Zyzz

2 Answers

0
votes

How about something like this (will use much less memory but you still need to examine large number of values - 2^N )

import static java.util.stream.IntStream.range;

public class Subsets implements Iterator<List<Integer>> {

    private final int level;
    private final LinkedList<List<Integer>> queue = new LinkedList<>();


    public Subsets(int level) {
        this.level = level;
        range(0, level).forEach(i -> queue.add(Arrays.asList(i)));
    }

    @Override
    public boolean hasNext() {
        return !queue.isEmpty();
    }

    public List<Integer> next() {
        List<Integer> list = queue.removeFirst();
        int maxValue = list.get(list.size() - 1);

        if(list.size() < level) {

            for (int k = maxValue+1; k < level; k++) {
                List<Integer> newList = new ArrayList<>(list);
                newList.add(k);
                queue.addFirst(newList);
            }
        }
        return list;
    }

    public static void main(String[] args) {
        Subsets s4 = new Subsets(4);
        while (s4.hasNext()) {
            System.err.println(s4.next());

        }
    }
}

To use this you will need to map the names of your sets (keys) to integers. Sample output:

[0]
[0, 3]
[0, 2]
[0, 2, 3]
[0, 1]
[0, 1, 3]
[0, 1, 2]
[0, 1, 2, 3]
[1]
[1, 3]
[1, 2]
[1, 2, 3]
[2]
[2, 3]
[3]
0
votes

So, summarizing points I posted as comments.

In your case, generating all subsets of sets is definitely not an option since number of such subsets will be ~2N. For N = 50 it is more than the Earth exists in nanoseconds.

I assume to switch from subsets of sets to subsets of their items. Say, M subsets have N distinct items, and merge threshold is T. So you need to try just ~NTk-combinations of size T looking which subsets could be merged through that combination of items, which for small T is acceptable.

Algorithm will be the following:

let D - collection of initial sets
let S - collection of distinct elements in sets across D

for each k-combination c over S {
   M = new M(c)          // merge object, which stores subset of merged sets and k-combination by which they are merged
   for each (s in D) {
      if (s.containsAll(c))
         M.sets.add(s)
   }
   if (M.sets.size > 0)  // some sets was merged through c
       merges.add(M)
} 

After that, taking all possible pairs of merges, remove those that are fully covered by other merges:

for each m in merges {
    for each m1 in merges {
        if (m.sets.containsAll(m1.sets))
            m1.markDeleted()
    }
}