I also had this problem. The idea is that cluster dumper dumps all your cluster data with points and so on. You have two choices:
- modify ClusterDumper.printClusters() method so it will not print all the terms and weights. I have some code like:
String clusterInfo = String.format("Cluster %d (%d) with %d points.\n", value.getId(), clusterCount, value.getNumPoints());
writer.write(clusterInfo);
writer.write('\n');
// list all top terms
if (dictionary != null) {
String topTerms = getTopFeatures(value.getCenter(), dictionary, numTopFeatures);
writer.write("\tTop Terms: ");
writer.write(topTerms);
writer.write('\n');
}
// list all the points in the cluster
List points = clusterIdToPoints.get(value.getId());
if (points != null) {
writer.write("\tCluster points:\n\t");
for (Iterator iterator = points.iterator(); iterator.hasNext();) {
WeightedVectorWritable point = iterator.next();
writer.write(String.valueOf(point.getWeight()));
writer.write(": ");
if (point.getVector() instanceof NamedVector) {
writer.write(((NamedVector) point.getVector()).getName() + " ");
}
}
writer.write('\n');
}
- do some grep magic if possible and eliminate all the info about terms and weights.