0
votes

My data exists in 128 dimensions, I'trying to reduce my data to 3 dimensions to visualize my data and preserve the Euclidean distance. Then distance represent the similarity between two data points.

Original data X: 5 * 128 (5 data points)

[[ -4.46e-02   1.57e-01   2.17e-01   1.24e-01   6.01e-02   7.61e-02
    6.38e-02  -1.05e-01  -2.55e-02   5.99e-02  -8.38e-02   5.93e-02
   -1.58e-01  -1.05e-01   1.31e-01  -5.33e-02  -4.18e-02   9.32e-02
   -1.62e-02  -9.19e-02  -1.30e-01   8.56e-02  -6.13e-02   3.78e-02
    7.84e-02  -9.74e-02  -9.42e-02   7.47e-02  -4.65e-02   7.36e-03
   -9.19e-04   1.37e-01  -8.52e-02   9.27e-02   6.50e-02  -2.61e-02
    7.21e-02  -1.83e-01  -2.49e-02  -9.85e-03   1.57e-01  -7.98e-02
    1.50e-01  -1.40e-01  -2.39e-02   4.19e-02   6.98e-02  -1.27e-02
   -7.56e-02   4.44e-02   1.86e-01  -2.22e-03  -1.79e-02  -3.90e-02
    7.72e-02   4.47e-02  -8.15e-02  -4.31e-02  -6.52e-03   7.73e-02
   -1.37e-02   5.78e-02  -1.25e-01  -1.58e-01   1.37e-01   9.34e-02
   -6.07e-03  -1.69e-01  -2.12e-01   2.14e-01  -4.05e-02   1.29e-01
    4.42e-02   1.71e-01  -2.13e-02   8.00e-03   7.17e-02   4.57e-03
   -6.55e-03  -1.66e-01   3.73e-02   1.01e-01  -1.26e-03   1.96e-02
    5.44e-02  -1.04e-01  -5.32e-02  -1.57e-02  -6.31e-02   1.89e-01
    2.43e-02   1.59e-02   9.13e-03  -4.41e-02  -5.96e-03   1.03e-01
    4.33e-02  -3.94e-02   7.85e-02   3.61e-02  -2.32e-02   3.69e-03
   -9.57e-03  -1.47e-02   2.61e-02  -4.15e-04   1.41e-02  -4.22e-02
   -7.42e-02   1.07e-01   9.08e-03   3.45e-02   6.41e-02  -5.37e-02
    1.57e-02  -1.91e-01   8.21e-02   3.31e-02   3.57e-02   1.37e-02
    1.56e-01   6.25e-02   4.54e-02  -1.07e-02   1.08e-01   2.69e-02
    9.57e-02  -1.24e-01]
...
]

Original distance matrix dist:

dist = DataArray(squareform(pdist(X, 'euclidean')))

[[ 0.  ,  0.67,  0.62,  0.7 ,  0.67],
 [ 0.67,  0.  ,  0.48,  0.76,  0.46],
 [ 0.62,  0.48,  0.  ,  0.7 ,  0.48],
 [ 0.7 ,  0.76,  0.7 ,  0.  ,  0.6 ],
 [ 0.67,  0.46,  0.48,  0.6 ,  0.  ]]

T-SNE:

from sklearn.manifold import TSNE

model = TSNE(n_components=3, random_state=0)
x_tsne = model.fit_transform(x)

x_tsne:

[[  1.78e-04   4.02e-05   1.01e-04]
 [  2.25e-04   1.90e-04  -1.00e-04]
 [  9.43e-05  -1.72e-05  -1.21e-05]
 [  4.02e-05   1.36e-05   1.49e-04]
 [  7.44e-05   1.08e-05   4.45e-05]]

dist_tsne:

[[  0.00e+00,   2.55e-04,   1.52e-04,   1.49e-04,   1.22e-04],
   [  2.55e-04,   0.00e+00,   2.60e-04,   3.57e-04,   2.75e-04],
   [  1.52e-04,   2.60e-04,   0.00e+00,   1.72e-04,   6.62e-05],
   [  1.49e-04,   3.57e-04,   1.72e-04,   0.00e+00,   1.10e-04],
   [  1.22e-04,   2.75e-04,   6.62e-05,   1.10e-04,   0.00e+00]]

I compares dist and dist_tsne, I noticed that the values are not same, and they are not even proportional. How can I preserve the Euclidean distance while reduce the dimension?

1

1 Answers

0
votes

That's theoretically not possible in general.

Your original data is living in much more dimensions and you can't throw away some of them while retaining the distances.

An example:

  • Imagine the 3 points of an equilateral triangle (in 2d-space)
    • Every pair of points has the same distance
  • Try to map this to a 1-dimensional sequence (number line)
    • It's not possible to keep the pairwise distances

The task of T-SNE and others is: map these point to some lower-dimensional space while keeping the distances visually so that we humans grasp some information hidden in many dimensions.