3
votes

At the moment i am using the pdist function in Matlab, to calculate the euclidian distances between various points in a three dimensional cartesian system. I'm doing this because i want to know which point has the smallest average distance to all the other points (the medoid). The syntax for pdist looks like this:

% calculate distances between all points
distances = pdist(m);

But because pdist returns a one dimensional array of distances, there is no easy way to figure out which point has the smallest average distance (directly). Which is why i am using squareform and then calculating the smallest average distance, like so:

% convert found distances to matrix of distances
distanceMatrix = squareform(distances);

% find index of point with smallest average distance
[~,j] = min(mean(distanceMatrix,2));

The distances are averaged for each column, and the variable j is the index for the column (and the point) with the smallest average distance.

This works, but squareform takes a lot of time (this piece of code is repeated thousands of times), so i am looking for a way to optimise it. Does anyone know of a faster way to deduce the point with the smallest average distance from the results of pdist?

4
Do you have the same number of points for each call of your mediod function? - reve_etrange
Also, do you need the pairwise distances themselves? Or just the identity of the mediod? - reve_etrange
@reve_etrange, the number of points is different from time to time. And in the end all i need is the coordinates of the medoid (so i dont actually need the distances themselves, but use them to calculate the medoid). - user1218247
I experimented with precomputation of the indices. I think it can be faster if there are always the same number of points. Otherwise @yuk's answer is fastest AFAIK. I have nearly identical code in some of my functions. - reve_etrange

4 Answers

3
votes

I think for your task using SQUAREFORM function is the best way from vectorization view point. If you look at the content of this function by

edit squareform

You will see that it performs a lot of checks that take time of course. Since you know your input to squareform and can be sure it will work, you can create your custom function with just the core of squareform.

[r, c] = size(m);
distanceMatrix = zeros(r);
distanceMatrix(tril(true(r),-1)) = distances;
distanceMatrix = distanceMatrix + distanceMatrix';

Then run the same code as you did to find the medioid.

1
votes

Here's an implementation that doesn't require a call to squareform:

N1 = 10;
dim = 5;

% generate points
X = randn(N1, dim);

% find mean distance
for iter=N1:-1:1
    d_mean(iter) = mean(pdist2(X(iter,:),X([1:(iter-1) (iter+1):end],:),'euclidean'));
    % D(iter,:) = pdist2(X(iter,:),X([1:(iter-1) (iter+1):end],:),'euclidean');
end

[val ind] = min(d_mean);

But without knowing more about your problem, I have no idea if it would be faster.

If this is the lynchpin for your program's performance, you may need to consider other speedup options like mex.

Good luck.

1
votes

When pdist computes distances between pairs of observations (1,2,...,n), the distances are arranged in the following order:

(2,1), (3,1), ..., (m,1), (3,2), ..., (m,2), ..., (m,m–1))

To demonstrate this, try the following:

> X = [.2 .1 .7 .5]';
> D = pdist(X)
.1  .5  .3   .6  .4  .2

In this example, X stores n=4 observations. The result, D, is a vector of distances between observations (2,1), (3,1), (4,1), (3,2), (4,2), (5,4). This arrangement corresponds with the entries of the lower triangular part of the following n-by-n matrix:

M=
 0  0  0 0
.1  0  0 0
.5 .6  0 0
.3 .4 .2 0

Notice that D(1)=M(2,1), D(2)=(3,1) and so on. So, one way to get the pair of indices in M that correspond with D(k) would be to compute the linear index of D(k) in M. This could be done as follows:

% matrix size
n = 4;
% r(j) is the no. of elements in cols 1..j, belonging to the upper triangular part 
r = cumsum(1:n-1);       
% p(j) is the no. elements in cols 1..j, belonging to the lower triangular part 
p = cumsum(n-1:-1:1);
% The linear index of value D(k)
q = find(p >= k, 1);
% The subscript indices of value D(k)
[i j] = ind2sub([n n], k + r(q));

Notice that n, r and p need to be set only once. From that point, you can find the index for any given k using the last two lines. Let's check this:

for k = 1:6
   q = find(p >= k, 1);
   [i, j] = ind2sub([n n], k + r(q));
   fprintf('D(%d) is the distance between observations (%d %d)\n', k, i, j);
end

Here's the output:
D(1) is the distance between observations (2 1)
D(2) is the distance between observations (3 1)
D(3) is the distance between observations (4 1)
D(4) is the distance between observations (3 2)
D(5) is the distance between observations (4 2)
D(6) is the distance between observations (4 3)

0
votes

There is no need to use squareform:

distances = pdist(m);
l=length(distances);
n=(1+sqrt(1+4*l))/2;
m=[];
for i=1:n
  idx=[1+i:n:length(distances)];
  m(i)=mean(distances(idx));
end

j=min(m);

I am not sure, but maybe this can be vectorised as well, but now it is late.