1
votes

I have about 200 points in Cartesian plane (2D). I want to cluster these points to k clusters with respect to arbitrary distance function (not matrix) and get the so-called centroid or representatives of these clusters. I know kmeans does this with respect to some special distance functions such as Euclidean, Manhattan, Cosine, etc. But, kmeans cannot handle arbitrary distance function because for example in centroid-updating phase of kmeans with respect to Euclidean distance function, mean of the points in each cluster is the LSE and minimizes the sum of distances of the nodes in the cluster to its centroid (mean); however, mean of the points may not minimize the ditances when the distance function is something arbitrary. Could you please help me about it and tell me if you know about any clustering algorithms that can work for me?

3
First, note that in the literature, "distance" means: (1) d(x,y)=d(y,x), (2) d(x,y) <= d(x,z) + d(z,y), (3) d(x,y) = 0 if and only if x=y. Is this correct in your case as well?amit
Thanks for your consideration. No, actually 1 and 3 hold in our case, but not 2. There might be x,y, and z for which d(x,y) > d(x,z) + d(z,y).user3314148

3 Answers

2
votes

If you replace "mean" with "most central point in cluster", then you get the k-medoids algorithm. Wikipedia claims that a metric is required, but I believe that to be incorrect, since I can't see where the majorization-minimization proof needs the triangle inequality or even symmetry.

2
votes

There are various clustering algorithms that can work with arbitrary distance functions, in particular:

  • hierarchical clustering
  • k-medoids (PAM)
  • DBSCAN
  • OPTICS
  • many many more - get some good clustering book and/or software

But the only one which enforces k clusters and uses a "cluster representative" model is k-medoids. You may be putting too many constraints on the cluster model to get a wider choice.

0
votes

Since you want something that represents a centroid but is not one of the data points, a technique I once used was to perform something like Kmedoids on N random samples, then I took all the members of each cluster and used them as samples to build a classifier which returned a class label... in the end each class label returned from the classifier ends up being an abstract notion of a set of cluster/centroids. I did this for a very specific and nuanced reason, I know the flaws. If you don't want to have to specify K, and your vectors are not enormous and super sparse, then I would take a look at the cobweb clustering in JavaML, JavaML also has a decent KMedoids.