0
votes

I'm a little confused about online kmeans clustering. I know that it allows me to cluster with just one data at a time. But,is this all limited to one session? Suppose that I have a bunch of data clustered via this method and I get the clustered data result, would I be able to add more data to the cluster in the future?

I've also been looking for implementations of this code, and to no avail. Anyone know of any?

Update: To clarify more. Here is how my code works right now:

  1. Image is taken from live video feed, once enough pictures are saved, get kmeans of sift features.
  2. Repeat step 1, a new batch of live feed pictures, get kmeans again. Combine the kmeans vectors with the previous kmeans like :[A B]

You can see that this is bad, because I quickly get too much clusters, and each batch of clusters will definitely have overlaps with another batch.

What I want:

  1. Image taken from live video feed, once pics are saved, get kmeans
  2. Repeat step 1, get kmeans again, which updates and adds new clusters to the previous cluster.

Nothing that I've seen could accommodate that, unless I'm just not understanding them correctly.

1
@mschonaker preferably matlab, but i could live with C++mugetsu

1 Answers

1
votes

If you look at the original (!) publications, the method proposed by MacQueen - where the name k-means comes from - was in fact an online algorithm. I'm not sure if MacQueen did multiple passes over the data to improve the result. I believe he used a single pass, and objects would never be reassigned to a different cluster. If so, it was already an online algorithm!

Means are commonly computed as sum / count. This is not very sensible from a numerical point of view. E.g. in the classic Knuth book you can find a method for incrementally updating means. Wikipedia has it also.

Things get slightly more complicated once you actually want to reassign earlier points. But usually in a streaming context you do not know the previous points, so you cannot do that anyway.