0
votes

I 'm using k-means algorithm for clustering my data. I have 5 thousand samples. .(Each of my sample is about a customer. to analyse customer value I 'm going to clustering them base on 4 behavior features.) The distance is calculated using the Euclidean metric and Pearson correlation.

I need to know

I don't know Euclidean distance is the correct method for calculating distances or Pearson correlation? I 'm using silhouette to validate my clustering. when I'm using Pearson correlation silhouette value is more than when I use Euclidean metric. Whether this means that Pearson correlation is more appropriate for distance metric?

1

1 Answers

0
votes

k-means does not support arbitrary distances.

It is based on variance minimization, which corresponds to (squared) Euclidean distance.

With Peason correlation, it will fail badly.

See this answer for an example how k-means fails badly with Pearson:

https://stackoverflow.com/a/21335448/1060350

short summary: the mean does not work for Pearson, but k-means is based on computing means. Use PAM or a similar method instead that uses medoids.