Find a subset of k most distant point each other

Question

I have a set of N points (in particular this point are binary string) and for each of them I have a discrete metric (the Hamming distance) such that given two points, i and j, Dij is the distance between the i-th and the j-th point. I want to find a subset of k elements (with k < N of course) such that the distance between this k points is the maximum as possibile. In other words what I want is to find a sort of "border points" that cover the maximum area in the space of the points. If k = 2 the answer is trivial because I can try to search the two most distant element in the matrix of distances and these are the two points, but how I can generalize this question when k>2? Any suggest? It's a NP-hard problem? Thanks for the answer

Do you have a specific objective function whose codomain is totally ordered? (For example, do you know whether you'd prefer distances [1,3,5] or [2,3,4]?) — user380772
No I don't have a specific objective function: simply I want to choose this k points such that they are furthest each other. I originally think a Naive approach but unfortunately it doesn't work well: simply i just consider as start point the point that is most distant from the other, then i find the point that is most distant from this point and again, i compute the distance of the remaining points from this two point, take the mean, and choose the point that has the max avg distance from the two that i have already select, and this for all the remaining point until i reach the k. — Dario Alise

mcdowella mcdowella · Accepted Answer · 2017-07-14T04:40:51

One generalisation would be "find k points such that the minimum distance between any two of these k points is as large as possible".

Unfortunately, I think this is hard, because I think if you could do this efficiently you could find cliques efficiently. Suppose somebody gives you a matrix of distances and asks you to find a k-clique. Create another matrix with entries 1 where the original matrix had infinity, and entries 1000000 where the original matrix had any finite distance. Now a set of k points in the new matrix where the minimum distance between any two points in that set is 1000000 corresponds to a set of k points in the original matrix which were all connected to each other - a clique.

This construction does not take account of the fact that the points correspond to bit-vectors and the distance between them is the Hamming distance, but I think it can be extended to cope with this. To show that a program capable of solving the original problem can be used to find cliques I need to show that, given an adjacency matrix, I can construct a bit-vector for each point so that pairs of points connected in the graph, and so with 1 in the adjacency matrix, are at distance roughly A from each other, and pairs of points not connected in the graph are at distance B from each other, where A > B. Note that A could be quite close to B. In fact, the triangle inequality will force this to be the case. Once I have shown this, k points all at distance A from each other (and so with minimum distance A, and a sum of distances of k(k-1)A/2) will correspond to a clique, so a program finding such points will find cliques.

To do this I will use bit-vectors of length kn(n-1)/2, where k will grow with n, so the length of the bit-vectors could be as much as O(n^3). I can get away with this because this is still only polynomial in n. I will divide each bit-vector into n(n-1)/2 fields each of length k, where each field is responsible for representing the connection or lack of connection between two points. I claim that there is a set of bit-vectors of length k so that all of the distances between these k-long bit-vectors are roughly the same, except that two of them are closer together than the others. I also claim that there is a set of bit-vectors of length k so that all of the distances between them are roughly the same, except that two of them are further apart than the others. By choosing between these two different sets, and by allocating the nearer or further pair to the two points owning the current bit-field of the n(n-1)/2 fields within the bit-vector I can create a set of bit-vectors with the required pattern of distances.

I think these exist because I think there is a construction that creates such patterns with high probability. Create n random bit-vectors of length k. Any two such bit-vectors have an expected Hamming distance of k/2 with a variance of k/4 so a standard deviation of sqrt(k)/2. For large k we expect the different distances to be reasonably similar. To create within this set two points that are very close together, make one a copy of the other. To create two points that are very far apart, make one the not of the other (0s in one where the other has 1s and vice versa).

Given any two points their expected distance from each other will be (n(n-1)/2 - 1)k/2 + k (if they are supposed to be far apart) and (n(n-1)/2 -1)k/2 (if they are supposed to be close together) and I claim without proof that by making k large enough the expected difference will triumph over the random variability and I will get distances that are pretty much A and pretty much B as I require.

Find a subset of k most distant point each other

3 Answers