identify documents from results of mahout clustering

Question

I am using mahout to cluster text documents indexed using solr.

I have used the "text" field in the document to form vectors. Then I used the k-means driver in mahout for clustering and then the clusterdumper utility to dump the results.

I am having difficulty in understanding the output results from the dumper. I could see the clusters formed with term vectors in those clusters. But how do I extract the documents from these clusters. I want the result to be the input documents appearing in different clusters.

I am also looking for an answer to this question. This discussion: lucidimagination.com/search/document/dab8c1f3c3addcfe/… seems to imply this is open issue, with a patch implemented in Mahout 0.5 here, issues.apache.org/jira/browse/MAHOUT-236. — user576993

ieugen ieugen · Accepted Answer · 2012-01-18T14:06:13

I also had this problem. The idea is that cluster dumper dumps all your cluster data with points and so on. You have two choices:

modify ClusterDumper.printClusters() method so it will not print all the terms and weights. I have some code like:



    String clusterInfo = String.format("Cluster %d (%d) with %d points.\n", value.getId(), clusterCount, value.getNumPoints());
                    writer.write(clusterInfo);
                    writer.write('\n');
    // list all top terms
    if (dictionary != null) {
                        String topTerms = getTopFeatures(value.getCenter(), dictionary, numTopFeatures);
                        writer.write("\tTop Terms: ");
                        writer.write(topTerms);
                        writer.write('\n');
                    }

    // list all the points in the cluster
    List points = clusterIdToPoints.get(value.getId());
                    if (points != null) {
                        writer.write("\tCluster points:\n\t");
                        for (Iterator iterator = points.iterator(); iterator.hasNext();) {
                            WeightedVectorWritable point = iterator.next();
                            writer.write(String.valueOf(point.getWeight()));
                            writer.write(": ");

                            if (point.getVector() instanceof NamedVector) {
                                writer.write(((NamedVector) point.getVector()).getName() + " ");
                            }

                        }
                        writer.write('\n');
                    }

do some grep magic if possible and eliminate all the info about terms and weights.

identify documents from results of mahout clustering

1 Answers