I'm analysing the k-means algorithm with Mahout. I'm going to run some tests, observe performance, and do some statistics with the results I get.
I can't figure out the way to run my own program within Mahout. However, the command-line interface might be enough.
To run the sample program I do
$ mahout seqdirectory --input uscensus --output uscensus-seq
$ mahout seq2sparse -i uscensus-seq -o uscensus-vec
$ mahout kmeans -i reuters-vec/tfidf-vectors -o uscensus-kmeans-clusters -c uscensus-kmeans-centroids -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 5 -ow -cl -k 25
The dataset is one large CSV file. Each line is a record. Features are comma separated. The first field is an ID. Because of the input format I can not use seqdirectory right away. I'm trying to implement the answer to this similar question How to perform k-means clustering in mahout with vector data stored as CSV? but I still have 2 Questions:
- How do I convert from CSV to SeqFile? I guess I can write my own program using Mahout to make this conversion and then use its output as input for seq2parse. I guess I can use CSVIterator (https://cwiki.apache.org/confluence/display/MAHOUT/File+Format+Integrations). What class should I use to read and write?
- How do I build and run my new program? I couldn't figure it out with the book Mahout in action or with other questions here.