How to choose the initial clusters for K-mean from Tf-IDF vectors

Question

I'm working with text clustering. I want to select specific documents (as a vector) to be a centroID fo k-means.

I have created the TF-IDF for my dataset by using Mahout, and I would like to choose the initial clusters from TFIDF vectors.

Anyone has an idea how I can specify the initial centroids in Mahout?

Yes, Mahout can select the centroid randomlly or by using Canopy, but I would like to select them manually. — Darsh

Rajkumar Rajkumar · Accepted Answer · 2014-11-18T08:30:34

bin/mahout kmeans
-c input clusters directory
-k optional number of initial clusters to sample from input vectors

If the -k argument is supplied, any clusters in the -c directory will be overwritten and -k random points will be sampled from the input vectors to become the initial cluster centers.

Reference: https://mahout.apache.org/users/clustering/k-means-clustering.html

How to choose the initial clusters for K-mean from Tf-IDF vectors

2 Answers