6
votes

my topic is similarity and clustering of (a bunch of) text(s). In a nutshell: I want to cluster collected texts together and they should appear in meaningful clusters at the end. To do this, my approach up to now is as follows, my problem is in the clustering. The current software is written in php.

1) Similarity: I treat every document as a "bag-of-words" and convert words into vectors. I use

  • filtering (only "real" words)
  • tokenization (split sentences into words)
  • stemming (reduce words to their base form; Porter's stemmer)
  • pruning (cut of words with too high & low frequency)

as methods for dimensionality reduction. After that, I'm using cosine similarity (as suggested / described on various sites on the web and here.

The result then is a similarity matrix like this:

        A   B   C   D   E 
    A   0  30  51  75  80
    B   X   0  21  55  70
    C   X   X   0  25  10
    D   X   X   X   0  15
    E   X   X   X   X   0

A…E are my texts and the number is the similarity in percent; the higher, the more similar the texts are. Because sim(A,B) == sim(B,A) only half of the matrix is filled in. So the similarity of Text A to Text D is 71%.

I want to generate a a priori unknown(!) number of clusters out of this matrix now. The clusters should represent the similar items (up to a certain stopp criterion) together.

I tried a basic implementation myself, which was basically like this (60% as a fixed similarity threshold)

    foreach article
      get similar entries where sim > 60
              foreach similar entry
              check if one of the entries already has a cluster number
              if no: assign new cluster number to all similar entries
              if yes: use that number

It worked (somehow), but wasn't good at all and the results were often monster-clusters. So, I want to redo this and already had a look into all kinds of clustering algorithms, but I'm still not sure which one will work best. I think it should be an agglomerative algoritm, because every pair of texts can be seen as a cluster in the beginning. But still the questions are what the stopp criterion is and if the algorithm should divide and / or merge existing clusters together.

Sorry if some of the stuff seems basic, but I am relatively new in this field. Thanks for the help.

3
Did you get any good answers? It's not even clear to me how many dimensions the clustering should work in...Jim

3 Answers

2
votes

Since you're both new to the field, have an unknown number of clusters and are already using cosine distance I would recommend the FLAME clustering algorithm.

It's intuitive, easy to implement, and has implementations in a large number of languages (not PHP though, largely because very few people use PHP for data science).

Not to mention, it's actually good enough to be used in research by a large number of people. If nothing else you can get an idea of what exactly the shortcomings are in this clustering algorithm that you want to address in moving onto another one.

1
votes

Just try some. There are so many clustering algorithms out there, nobody will know all of them. Plus, it also depends a lot on your data set and the clustering structure that is there. In the end, there also may be just this one monster cluster with respect to cosine distance and BofW features.

1
votes

Maybe you can transform your similarity matrix to a dissimilarity matrix such as transforming x to 1/x, then your problem is to cluster a dissimilarity matrix. I think the hierarchical cluster may work. These may help you:hierarchical clustering and Clustering a dissimilarity matrix