So it looks like this problem is NP-hard - and not only is it NP-hard, it's one of the NP-hard problems that's probably going to be really hard to get an approximate answer to.
To show this, here's a quick reduction from the chromatic number problem (given a graph, what's the minimum number of colors needed to color it?) to this problem. I'll illustrate with an example. Imagine we have this graph:
A -- B -- C
| | |
D -- E -- F
I'm going to start with this graph and form a set of vectors with "don't cares" so that each cluster corresponds to a set of nodes with no edges running between them (an independent set). Overall, each cluster will represent a set of nodes that can be given the same color, so finding the minimal clustering corresponds to finding the chromatic number of the graph.
Since there are six nodes in the graph, each vector will have six components. We'll form one vector per node in the graph. For example, the vector for the node A will be
[ 0 1 * 1 * * ]
The columns of the vector, in order, refer to nodes A, B, C, D, E, and F. The 0 in the A column means "this is the vector for A." The 1's in the columns for B and D mean "there are edges from this node to nodes B and D." The *'s in the other columns mean "there aren't any edges running between this node and those other nodes." If we form the set of vectors for all the nodes in the graph, we'll get this:
A B C D E F
A: [ 0 1 * 1 * * ]
B: [ 1 0 1 * 1 * ]
C: [ * 1 0 * * 1 ]
D: [ 1 * * 0 1 * ]
E: [ * 1 * 1 0 1 ]
F: [ * * 1 * 1 0 ]
Now, imagine trying to cluster these vectors. Notice that if you take any two adjacent nodes (say, B and E), their vectors don't match, because the B vector has a 0 in the B column and the E vector has a 1 in the B column. However, if you take any nodes that aren't adjacent, then they will match, since
- the columns for the nodes themselves will have 0's in one vector and *'s in the other, since each vector is tagged with which node it is and there's no edge running between them, and
- each other column in each vector is either a 1 or a *.
Overall, this means that a cluster of vectors corresponds to a collection of nodes that have no edges running between them, so finding a minimal clustering gives a coloring of the graph. In this case, notice that the original graph can be 2-colored by making A, C, and E the same color and B, D, and F another color. If we look at the vectors, they can be clustered into A, C, and E and B, D, and F:
A B C D E F
A: [ 0 1 * 1 * * ]
C: [ * 1 0 * * 1 ]
E: [ * 1 * 1 0 1 ]
B: [ 1 0 1 * 1 * ]
D: [ 1 * * 0 1 * ]
F: [ * * 1 * 1 0 ]
More generally, given a graph with n nodes, form n vectors of this form, where each vector corresponds to a node, has a 0 in the column for that node, a 1 in the column for each adjacent node, and a * in all other columns. The minimum number of clusters then corresponds to the chromatic number of the original graph, and this reduction can be done in polynomial time.
The problem is that the problem of finding the minimum chromatic number of a graph is NP-hard and no approximation algorithm is known that gets even close to a constant factor approximation. As a result, there's likely to be no good approximation algorithm for the original problem I described unless P = NP.
It's possible that the special case of the problem I need to solve happens to be easier to approximate, so I might post a follow-up question with more specifics of what I'm trying to do.