0
votes

I've run the following example:

https://github.com/technobium/mahout-clustering/blob/master/src/main/java/com/technobium/ClusteringDemo.java#L64

Document 1 -> John saw a red car.
Document 2 -> Marta found a red bike.
Document 3 -> Don need a blue coat.
Document 4 -> Mike bought a blue boat.
Document 5 -> Albert wants a blue dish.
Document 6 -> Lara likes blue glasses.
Document 7 -> Donna, do you have red apples?
Document 8 -> Sonia needs blue books.
Document 9 -> I like blue eyes.
Document 10 -> Arleen has a red carpet.

and it works as expected with EuclideanDistanceMeasure. But I'm not sure why the text-intended distance measures (TanimotoDistanceMeasure and CosineDistanceMeasure) are giving me just a single cluster.

Why is this? I'm not pretending I know anything about these 2 distance measures that are giving unsatisfactory results - but what might I need to change? There are a few too many numbers in there for me to understand the effect of each. I do have the book "Mahout in Action" though I have only read 2 chapters.

EuclideanDistanceMeasure (2 clusters - good)

 Clusters: 
         7 -> wt: 1.0 distance: 4.4960791719810365  vec: Document 1 = [8:2.609, 21:2.609, 29:1.693, 30:2.609]
         7 -> wt: 1.0 distance: 4.496079376645949  vec: Document 10 = [2:2.609, 9:2.609, 18:2.609, 29:1.693]
         7 -> wt: 1.0 distance: 4.496079576525459  vec: Document 2 = [3:2.609, 16:2.609, 25:2.609, 29:1.693]
         9 -> wt: 1.0 distance: 4.389955960700927  vec: Document 3 = [4:1.357, 10:2.609, 13:2.609, 27:2.609]
         9 -> wt: 1.0 distance: 4.389956011306051  vec: Document 4 = [4:1.357, 5:2.609, 7:2.609, 26:2.609]
         9 -> wt: 1.0 distance: 4.3899560687101395  vec: Document 5 = [0:2.609, 4:1.357, 11:2.609, 32:2.609]
         9 -> wt: 1.0 distance: 4.389956137136399  vec: Document 6 = [4:1.357, 17:2.609, 22:2.609, 24:2.609]
         7 -> wt: 1.0 distance: 5.577549042707083  vec: Document 7 = [1:2.609, 12:2.609, 14:2.609, 19:2.609, 29:1.693, 33:2.609]
         9 -> wt: 1.0 distance: 4.389956708176695  vec: Document 8 = [4:1.357, 6:2.609, 28:2.609, 31:2.609]
         9 -> wt: 1.0 distance: 4.389471924190491  vec: Document 9 = [4:1.357, 15:2.609, 20:2.609, 23:2.609]

produced by:

    CanopyDriver.run(new Path(vectorsFolder), new Path(canopyCentroids), new EuclideanDistanceMeasure(), 20, 5,
            true, 0, true);

    FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path(canopyCentroids, "clusters-0-final"),
            new Path(clusterOutput), 0.01, 20, 2, true, true, 0, false);

CosineDistanceMeasure (just 1 cluster - bad)

Clusters: 
         0 -> wt: 1.0 distance: 0.6362357041216559  vec: Document 1 = [8:2.609, 21:2.609, 29:1.693, 30:2.609]
         0 -> wt: 1.0 distance: 0.6362357041216559  vec: Document 10 = [2:2.609, 9:2.609, 18:2.609, 29:1.693]
         0 -> wt: 1.0 distance: 0.636235704121656  vec: Document 2 = [3:2.609, 16:2.609, 25:2.609, 29:1.693]
         0 -> wt: 1.0 distance: 0.6328896123664868  vec: Document 3 = [4:1.357, 10:2.609, 13:2.609, 27:2.609]
         0 -> wt: 1.0 distance: 0.6328896123664868  vec: Document 4 = [4:1.357, 5:2.609, 7:2.609, 26:2.609]
         0 -> wt: 1.0 distance: 0.6328896123664868  vec: Document 5 = [0:2.609, 4:1.357, 11:2.609, 32:2.609]
         0 -> wt: 1.0 distance: 0.6328896123664868  vec: Document 6 = [4:1.357, 17:2.609, 22:2.609, 24:2.609]
         0 -> wt: 1.0 distance: 0.5876411474816594  vec: Document 7 = [1:2.609, 12:2.609, 14:2.609, 19:2.609, 29:1.693, 33:2.609]
         0 -> wt: 1.0 distance: 0.6328896123664868  vec: Document 8 = [4:1.357, 6:2.609, 28:2.609, 31:2.609]
         0 -> wt: 1.0 distance: 0.6328896123664868  vec: Document 9 = [4:1.357, 15:2.609, 20:2.609, 23:2.609]

produced by

    CanopyDriver.run(new Path(vectorsFolder), new Path(canopyCentroids), new CosineDistanceMeasure(), 20, 5,
            true, 0, true);

    FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path(canopyCentroids, "clusters-0-final"),
            new Path(clusterOutput), 0.01, 20, 2, true, true, 0, false);

TanimotoDistanceMeasure (just 1 cluster - bad)

 Clusters: 
         0 -> wt: 1.0 distance: 0.8637279689324617  vec: Document 1 = [8:2.609, 21:2.609, 29:1.693, 30:2.609]
         0 -> wt: 1.0 distance: 0.8637279689324617  vec: Document 10 = [2:2.609, 9:2.609, 18:2.609, 29:1.693]
         0 -> wt: 1.0 distance: 0.8637279689324617  vec: Document 2 = [3:2.609, 16:2.609, 25:2.609, 29:1.693]
         0 -> wt: 1.0 distance: 0.8596377086023765  vec: Document 3 = [4:1.357, 10:2.609, 13:2.609, 27:2.609]
         0 -> wt: 1.0 distance: 0.8596377086023765  vec: Document 4 = [4:1.357, 5:2.609, 7:2.609, 26:2.609]
         0 -> wt: 1.0 distance: 0.8596377086023765  vec: Document 5 = [0:2.609, 4:1.357, 11:2.609, 32:2.609]
         0 -> wt: 1.0 distance: 0.8596377086023765  vec: Document 6 = [4:1.357, 17:2.609, 22:2.609, 24:2.609]
         0 -> wt: 1.0 distance: 0.8723755210900389  vec: Document 7 = [1:2.609, 12:2.609, 14:2.609, 19:2.609, 29:1.693, 33:2.609]
         0 -> wt: 1.0 distance: 0.8596377086023765  vec: Document 8 = [4:1.357, 6:2.609, 28:2.609, 31:2.609]
         0 -> wt: 1.0 distance: 0.8596377086023765  vec: Document 9 = [4:1.357, 15:2.609, 20:2.609, 23:2.609]

produced via

    CanopyDriver.run(new Path(vectorsFolder), new Path(canopyCentroids), new TanimotoDistanceMeasure(), 20, 5,
            true, 0, true);

    FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path(canopyCentroids, "clusters-0-final"),
            new Path(clusterOutput), 0.01, 20, 2, true, true, 0, false);
1
In my opinion, on that toy data, 1 cluster is the better result.Has QUIT--Anony-Mousse
I think it was happening even on my real data. Any suggestions what an 11th toy document might look like to get a 2nd cluster?Sridhar Sarnobat
For all of these measures, you'll need much longer documents for them to work well.Has QUIT--Anony-Mousse
Oh. I was trying with some bigger documents (1-3 non-prose paragraphs) and also got just one cluster. But thanks for the feedback, I’ll play with the data sets and try and establish cause and effect.Sridhar Sarnobat
@Anony-Mousse - thank you very much for the support, you turned out to be right in your first comment. If you copy and paste my answer as your own post I will mark it as the accepted answer so you get credit.Sridhar Sarnobat

1 Answers

0
votes

As Anony-Mousse said in his first response, the data I fed it belongs in a single cluster. After some soul searching in recent weeks (or more specifically, experimenting with the distance measure classes directly), I found a data set that results in more than one cluster:

1) Make sure the data is different enough

Text id1 = new Text("Document 1");
Text text1 = new Text("Atletico Madrid win");
writer.append(id1, text1);

Text id6 = new Text("Document 6");
Text text6 = new Text("Both apple and orange are fruit");
writer.append(id6, text6);

Text id7 = new Text("Document 7");
Text text7 = new Text("Both orange and apple are fruit");
writer.append(id7, text7);

2) Determine good radius values

a) Experiment with the DistanceMeasure class with your sample data

Vector v1 = toVector("Atletico Madrid win");
Vector v2 = toVector("Both apple and orange are fruit");
Vector v3 = toVector("Both orange and apple are fruit");
of = ImmutableList.of(v1, v2, v3);

List<Vector> vectorList = new LinkedList();
vectorList.addAll(of);
List<Canopy> canopies = CanopyClusterer.createCanopies(vectorList, new CosineDistanceMeasure(), 0.3, 0.3);
for (Canopy canopy : canopies) {
    System.out.println("DistanceMeasureMain.main() " + canopy.asFormatString());
}

produces:

DistanceMeasureMain.main() distance is 0.19193857965451055
DistanceMeasureMain.main() distance is 0.5281191379648771
DistanceMeasureMain.main() distance is 0.19193857965451055
DistanceMeasureMain.main() C0: {0:1.1,117724:1.0,378550445:1.0,1997849123:1.0}
DistanceMeasureMain.main() C1: {0:1.1,96727:1.0,96852:1.0,2076577:1.0,93029210:1.0,97711124:1.0,1008851410:1.0}

b) Use the distances as your radius values

I think the t1 and t2 values (0.2 and 0.2) for CanopyDriver.run() were also significant, though I don't know in intricate detail the effect of all the numerical parameters in the invocation below:

    // CosineDistanceMeasure
    CanopyDriver.run(new Path(vectorsFolder),
            new Path(canopyCentroids), new CosineDistanceMeasure(),
            0.2, 0.2, true, 1, true);

    FuzzyKMeansDriver.run(new Path(vectorsFolder), new Path(
            canopyCentroids, "clusters-0-final"), new Path(
            clusterOutput), 0.01, 20, 2, true, true, 0, false);

Output

Document 1 -> Atletico Madrid win
Document 6 -> Both apple and orange are fruit
Document 7 -> Both orange and apple are fruit

 Clusters: 
         0 -> wt: 1.0 distance: 0.0  vec: Document 1 = [1:1.405, 4:1.405, 6:1.405]
         1 -> wt: 1.0 distance: 0.0  vec: Document 6 = [0:1.000, 2:1.000, 3:1.000, 5:1.000]
         1 -> wt: 1.0 distance: 0.0  vec: Document 7 = [0:1.000, 2:1.000, 3:1.000, 5:1.000]