0
votes

I'm looking for a solution that outputs a cluster of 2D points from an estimated distance matrix between some of them. Thing is, the distance between those points are not exact (an approximation) and there aren't values for all pair combinations.

I'm looking for any type of solutions that creates approximate coordinates that respect at most the distances provided.

1

1 Answers

0
votes

You could use t-SNE, an algorithm finding an embedding of potentially high-dimensional data into low-dimensional data based on the distances of the elements.

You will have to decide what to do with the missing distances, whether to set them at high values, or the mean distance, or something else.

Since t-SNE will preserve only local neighbourhoods, relations of distant clusters may be less accurate, but you'd have to see whether that is bad for your application.


Update: example for t-SNE

If you download the vanilla MATLAB implementation from the linked website and put that folder on the path, the following self-contained example should run.

%% Generate some random data
mu = [1 2];
Sigma = [1 0; 0 1];
R = chol(Sigma);
z1 = repmat(mu,100,1) + randn(100,2)*R;
mu = [5 1.5];
z2 = repmat(mu,100,1) + randn(100,2)*R;
mu = [3.5 6.5];
z3 = repmat(mu,100,1) + randn(100,2)*R;

%% Plot random data
figure(1);
clf
subplot(3, 1, 1)
scatter(z1(:,1), z1(:,2))
hold on
scatter(z2(:,1), z2(:,2))
scatter(z3(:,1), z3(:,2))
title('Original data')

%% Generate pw distance matrix and plot it
all_z = [z1; z2; z3];
% generate pw distance matrix
pwd = squareform(pdist(all_z));

subplot(3, 1, 2)
imagesc(pwd)
title('Distance matrix')

%% Perform tSNE
no_dims = 2;
perplexity = 150;
yd = tsne_d(pwd, [], no_dims, perplexity);
%% Plot results
subplot(3, 1, 3)
scatter(yd(1:100, 1), yd(1:100, 2))
hold on
scatter(yd((1:100) + 100, 1), yd((1:100) + 100, 2))
scatter(yd((1:100) + 200, 1), yd((1:100) + 200, 2))
title('tSNE embedding')

After sanitizing your distance matrix (i.e. assigning some value for the missing ones), you would start at %% Perform tSNE. perplexity is a parameter that should match approximately the number of points you expect in a cluster. In my example I chose 150, as I also want to consider neighbouring points a bit further away. The original paper has a good description of what the algorithm does.