I am trying to find the point that is at a minimum distance from the candidate set. Z is a matrix where the rows are the dimension and columns indicate points. Computing the inter-point distances, and then recording the point with minimum distance and its distance as well. Below is the code snippet. The code works fine for a small dimension and small set of points. But, it takes a long time for large data set (N = 1 million data points and dimension is also high). Is there an efficient way?
1 Answers
I suggest that you use pdist
to do the heavy lifting for you. This function will compute the pairwise distance between every two points in your array. The resulting vector has to be put into matrix form using squareform
in order to find the minimal value for each pair:
N = 100;
Z = rand(2,N); % each column is a 2-dimensional point
% pdist assumes that the second index corresponds to dimensions
% so we need to transpose inside pdist()
distmatrix = squareform(pdist(Z.','euclidean')); % output is [N, N] in size
% set diagonal values to infinity to avoid getting 0 self-distance as minimum
distmatrix = distmatrix + diag(inf(1,size(distmatrix,1)));
mindists = min(distmatrix,[],2); % find the minimum for each row
sum_dist = sum(mindists); % sum of minimal distance between each pair of points
This computes every pair twice, but I think this is true for your original implementation.
The idea is that pdist
computes the pairwise distance between the columns of its input. So we put the transpose of Z
into pdist
. Since the full output is always a square matrix with zero diagonal, pdist
is implemented such that it only returns the values above the diagonal, in a vector. So a call to squareform
is needed to get the proper distance matrix. Then, the row-wise minimum of this matrix have to be found, but first we have to exclude the zero in the diagonals. I was lazy so I put inf
into the diagonals, to make sure that the minimum is elsewhere. In the end we just have to sum up the minimal distances.
pdist
. - Andras Deakpdist
will be an[N, N-1]
-sized matrix: each point's distance with respect to every other point. You simply need to find the minimum of this matrix for each row. In other words,min(pdist(Z,'euclidean'),[],2)
should be the minimal nearest-neighbour distance for each point, andsum()
ming this gives you what you need. Of course you need to check this to your proper implementation, but it should be very straightforward. - Andras Deaksum()
on that vector to sum it up..... - Andras Deaksquareform()
to get a proper matrix output frompdist
, and then we also have to handle the diagonal of the matrix to avoid finding the 0 distance of each point from itself. There's also a necessary transposition involved forpdist
to work. The problem is non-trivial enough that I did add an answer, please see below. - Andras Deak