3
votes

I have a rather large collection of n d-dimensional vectors with integer coordinates (d is about 50), except that in some cases the coordinates are a special sentinel "don't care" value, which I'll denote by *. I'm trying to find an efficient algorithm for merging together all vectors that compare equal to one another, where "equal" means "each pair of coordinates in the vectors match, assuming that * entries can match anything." For example, given these row vectors:

[* 1 2 * *]
[1 1 * 2 *]
[2 1 3 * 1]
[2 * 3 * *]
[1 * 3 4 *]

We'd break them into these clusters:

[* 1 2 * *], [1 1 * 2 *]
[2 1 3 * *], [2 * 3 * *]
[1 * 3 4 *]

The requirement is that all clusters need to have all vectors in them be pairwise equivalent. It's possible that there are many ways to cluster things together meeting these criteria, so for some loose definition of "not too many" there should be "not too many" clusters.

It's certainly possible to do this by comparing all vectors pairwise and pairing off ones that match, then repeating this until all the vectors are clustered. Another option (the one we're currently using) is to start with the first vector in its own cluster, then for each vector in order check if it matches any existing clusters or whether it needs to go into its own. These approaches are either quadratic or run in time proportional to the product of the number of vectors times the number of clusters, which for our application isn't fast enough.

Are there any efficient algorithms for solving this particular problem?

1
Do you have some information about the integer coordinates range ? also about what is "rather large" ?lemon
The number of vectors is in the hundreds of thousands and the range of the integer coordinates is roughly in the tens to hundreds of thousands.templatetypedef
One thing is unclear to me : V1 can match V2, V2 can match V3 but it is possible that V1 does not match V3. So, is there some choices to do here, for instance minimizing vector groups (and each vector belongs only to one group) ? Or do you want to have 2 groups in this example (V1, V2) and (V2, V3) ?lemon
Good question! Every vector needs to belong to exactly one group.templatetypedef
So the aim is to minimize the amount of groups, or just take it 'as it comes' ?lemon

1 Answers

2
votes

So it looks like this problem is NP-hard - and not only is it NP-hard, it's one of the NP-hard problems that's probably going to be really hard to get an approximate answer to.

To show this, here's a quick reduction from the chromatic number problem (given a graph, what's the minimum number of colors needed to color it?) to this problem. I'll illustrate with an example. Imagine we have this graph:

   A -- B -- C
   |    |    |
   D -- E -- F

I'm going to start with this graph and form a set of vectors with "don't cares" so that each cluster corresponds to a set of nodes with no edges running between them (an independent set). Overall, each cluster will represent a set of nodes that can be given the same color, so finding the minimal clustering corresponds to finding the chromatic number of the graph.

Since there are six nodes in the graph, each vector will have six components. We'll form one vector per node in the graph. For example, the vector for the node A will be

[ 0 1 * 1 * * ]

The columns of the vector, in order, refer to nodes A, B, C, D, E, and F. The 0 in the A column means "this is the vector for A." The 1's in the columns for B and D mean "there are edges from this node to nodes B and D." The *'s in the other columns mean "there aren't any edges running between this node and those other nodes." If we form the set of vectors for all the nodes in the graph, we'll get this:

     A B C D E F
A: [ 0 1 * 1 * * ]
B: [ 1 0 1 * 1 * ]
C: [ * 1 0 * * 1 ]
D: [ 1 * * 0 1 * ]
E: [ * 1 * 1 0 1 ]
F: [ * * 1 * 1 0 ]

Now, imagine trying to cluster these vectors. Notice that if you take any two adjacent nodes (say, B and E), their vectors don't match, because the B vector has a 0 in the B column and the E vector has a 1 in the B column. However, if you take any nodes that aren't adjacent, then they will match, since

  • the columns for the nodes themselves will have 0's in one vector and *'s in the other, since each vector is tagged with which node it is and there's no edge running between them, and
  • each other column in each vector is either a 1 or a *.

Overall, this means that a cluster of vectors corresponds to a collection of nodes that have no edges running between them, so finding a minimal clustering gives a coloring of the graph. In this case, notice that the original graph can be 2-colored by making A, C, and E the same color and B, D, and F another color. If we look at the vectors, they can be clustered into A, C, and E and B, D, and F:

     A B C D E F
A: [ 0 1 * 1 * * ]
C: [ * 1 0 * * 1 ]
E: [ * 1 * 1 0 1 ]

B: [ 1 0 1 * 1 * ]
D: [ 1 * * 0 1 * ]
F: [ * * 1 * 1 0 ]

More generally, given a graph with n nodes, form n vectors of this form, where each vector corresponds to a node, has a 0 in the column for that node, a 1 in the column for each adjacent node, and a * in all other columns. The minimum number of clusters then corresponds to the chromatic number of the original graph, and this reduction can be done in polynomial time.

The problem is that the problem of finding the minimum chromatic number of a graph is NP-hard and no approximation algorithm is known that gets even close to a constant factor approximation. As a result, there's likely to be no good approximation algorithm for the original problem I described unless P = NP.

It's possible that the special case of the problem I need to solve happens to be easier to approximate, so I might post a follow-up question with more specifics of what I'm trying to do.