0
votes

I want to cluster multiple documents using Mahout. The clustering works fine but I have no idea how to find out which documents are located in each cluster.

I read that you can use the option --namedVector when creating the sparse-files but where does it take the ID from and how can I retrieve this ID after the clustering is completed?


Right now I am doing the following steps:

I have a directory with a file for each document. The files are in the following format with the ID of the document as filename:

filename: documentID.txt

[TITLE]

[CONTENT]

I create a sparse directory with namedVectors using:

./mahout seqdirectory -i tmp/es-out -o tmp/es-out-seqdir -c UTF-8 -chunk 64 -xm sequential
./mahout seq2sparse -i tmp/es-out-seqdir -o tmp/es-out-sparse --maxDFPercent 85 --namedVector

Then I can cluster the results and create a dump:

./mahout kmeans -i tmp/es-out-sparse/tfidf-vectors -c tmp/es-kmeans-clusters -o tmp/es-kmeans -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -x 10 -k 20 -ow --clustering
./mahout clusterdump -i tmp/es-kmeans/clusters-10-final -o tmp/clusterdump -d tmp/es-out-sparse/dictionary.file-0 -dt sequencefile -b 100 -n 20 --evaluate -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -sp 0 --pointsDir tmp/es-kmeans/clusteredPoints

The dump looks like this:

:VL-190{n=1 c=[1:3.407, 110:6.193, 2007:3.736, about:1.762, according:2.948, account:3.507, acting:6.
  Top Terms: 
    epa                                     =>  13.471728324890137
    mountaintop                             =>  11.364262580871582
    mine                                    =>  10.942587852478027

  Weight : [props - optional]:  Point:

[...]
2

2 Answers

0
votes

k-means in Mahout is only a toy.

You can use it for howtos and tutorials, but for real use it is too slow, too limited, roo hard to use. (Also, k-means results are not half as good as people think... most of the time they are dogfood.)

Benchmark other tools, and you'll be surprised big time.

0
votes

I found a way. You can use the seqdumper to extract the cluster mapping:

./mahout seqdumper -i /tmp/es-kmeans/clusteredPoints/part-m-00000 -o /tmp/cluster-points.txt

Than you can use a regex to extract the mapping of the vector IDs to cluster IDs.