10
votes

I have a large data set that I would like to cluster. My trial run set size is 2,500 objects; when I run it on the 'real deal' I will need to handle at least 20k objects.

These objects have a cosine similarity between them. This cosine similarity does not satisfy the requirements of being a mathematical distance metric; it doesn't satisfy the triangle inequality.

I would like to cluster them in some "natural" way that puts similar objects together without needing to specify beforehand the number of clusters I expect.

Does anyone know of an algorithm that would do that? Really, I'm just looking for any algorithm that doesn't require a) a distance metric and b) a pre-specified number of clusters.

Many thanks!

This question has been asked before here: Clustering from the cosine similarity values (but this solution only offers K-means clustering), and here: Effective clustering of a similarity matrix (but this solution was rather vague)

3
From en.wikipedia.org/wiki/Cosine_similarity: "Although the term "cosine similarity" has been used for this angular distance, the term is oddly used as the cosine of the angle is used only as a convenient mechanism for calculating the angle itself and is no part of the meaning. The advantage of the angular similarity coefficient is that, when used as a difference coefficient (by subtracting it from 1) the resulting function is a proper distance metric, which is not the case for the first meaning."phs
Thanks! Unfortunately I should have been more specific; I'm using a cosine-like similarity which I've defined myself. It does not satisfy the triangle inequality.user1473883

3 Answers

3
votes

Apache mahout has a number of clustering algorithms, including some which don't require you to specify N and which allow you to specify the distance metric.

Mean shift clustering is similar to k-means but without a pre specified number of clusters https://cwiki.apache.org/confluence/display/MAHOUT/Mean+Shift+Clustering.

Then more generally, if you would like to try a variety of algorithms, there is an absolute wealth of sophisticated packages available for R (including a few variational Bayesian implementations of EM which will select the best number of clusters) which have proved very useful for some of my research in the past: http://cran.r-project.org/web/views/Cluster.html.

2
votes

Actually most algorithms that require a "distance function" do not have the requirement for it to be metric.

DBSCAN can be generalized (see Wikipedia) to a version where it even abstracted from distance, it just needs to have some kind of "dense" notion. (DBSCAN also does not need to know the number of clusters beforehand)

But even for k-means - which has rather strict requirements on the distance, even beyond metrical - there is a variant called spherical k-means.

Anyway, in a database context, the full requirements of "metric" are utopic. In any real world data, there may be two records with the same coordinates, so at most you would have a pseudo-metric. The triangle inequality mostly plays a role for optimization (e.g. by using an M-tree index, which has strict triangle inequality requirements) or accelerated k-means exploiting this property.

2
votes

You could also try Affinity Propagation (http://www.psi.toronto.edu/index.php?q=affinity%20propagation). The algorithm takes a similarity matrix as input, and can also, I believe, automatically adjust the number of cluster centroids.