10
votes

I have a large set of points (n > 10000 in number) in some metric space (e.g. equipped with Jaccard Distance). I want to connect them with a minimal spanning tree, using the metric as the weight on the edges.

  • Is there an algorithm that runs in less than O(n2) time?
  • If not, is there an algorithm that runs in less than O(n2) average time (possibly using randomization)?
  • If not, is there an algorithm that runs in less than O(n2) time and gives a good approximation of the minimum spanning tree?
  • If not, is there a reason why such algorithm can't exist?

Thank you in advance!

Edit for the posters below: Classical algorithms for finding minimal spanning tree don't work here. They have an E factor in their running time, but in my case E = n2 since I actually consider the complete graph. I also don't have enough memory to store all the >49995000 possible edges.

2
@Nikolai: of course I did. And many papers too. - Yakov Galka
You wouldn't need to "store" your 10^8 edges. You would need a bit vector to be able to mark visited edges, but this bit vector would only use 12 MB or so, which seems affordable as far as memory is concerned. - Sven Marnach
@Sven: a) 10000 vertices is a lower bound. b) Kruskal needs them to be stored and sorted. - Yakov Galka

2 Answers

7
votes

Apparently, according to this: Estimating the weight of metric minimum spanning trees in sublinear time there is no deterministic o(n^2) (note: smallOh, which is probably what you meant by less than O(n^2), I suppose) algorithm. That paper also gives a sub-linear randomized algorithm for the metric minimum weight spanning tree.

Also look at this paper: An optimal minimum spanning tree algorithm which gives an optimal algorithm. The paper also claims that the complexity of the optimal algorithm is not yet known!

The references in the first paper should be helpful and that paper is probably the most relevant to your question.

Hope that helps.

4
votes

When I was looking at a very similar problem 3-4 years ago, I could not find an ideal solution in the literature I looked at.

The trick I think is to find a "small" subset of "likely good" edges, which you can then run plain old Kruskal on. In general, it's likely that many MST edges can be found among the set of edges that join each vertex to its k nearest neighbours, for some small k. These edges might not span the graph, but when they don't, each component can be collapsed to a single vertex (chosen randomly) and the process repeated. (For better accuracy, instead of picking a single representative to become the new "supervertex", pick some small number r of representatives and in the next round examine all r^2 distances between 2 supervertices, choosing the minimum.)

k-nearest-neighbour algorithms are quite well-studied for the case where objects can be represented as vectors in a finite-dimensional Euclidean space, so if you can find a way to map your objects down to that (e.g. with multidimensional scaling) then you may have luck there. In particular, mapping down to 2D allows you to compute a Voronoi diagram, and MST edges will always be between adjacent faces. But from what little I've read, this approach doesn't always produce good-quality results.

Otherwise, you may find clustering approaches useful: Clustering large datasets in arbitrary metric spaces is one of the few papers I found that explicitly deals with objects that are not necessarily finite-dimensional vectors in a Euclidean space, and which gives consideration to the possibility of computationally expensive distance functions.