3
votes

First of all,thanks for reading my question.

I used TF/IDF then on those values, I calculated cosine similarity to see how many documents are more similar. You can see the following matrix. Column names are like doc1, doc2, doc3 and rows names are same like doc1, doc2, doc3 etc. With the help of following matrix, I can see that doc1 and doc4 has 72% similarity (0.722711142). It is correct even if I see both documents they are similar. I have 1000 documents and I can see each document freq. in matrix to see how many of them are similar. I used different clustering like k-means and agnes ( hierarchy) to combine them. It made clusters. For example Cluster1 has (doc4, doc5, doc3) becoz they have values (0.722711142, 0.602301766, 0.69912109) more close respectively. But when I see manually if these 3 documents are realy same so they are NOT. :( What am I doing or should I use something else other than clustering??????

    1             0.067305859  -0.027552299   0.602301766   0.722711142    
    0.067305859   1             0.048492904   0.029151952  -0.034714695 
   -0.027552299   0.748492904   1             0.610617214   0.010912109    
    0.602301766   0.029151952  -0.061617214   1             0.034410392    
    0.722711142  -0.034714695   0.69912109    0.034410392   1            

P.S: The values can be wrong, it is just to give you an idea. If you have any question please do ask. Thanks

2
any tip?? any help?????????????????????? - user238384
minor question: existing solutions to that problem cannot be applied or why do you develop it from scratch? My feeling says that lucene (or solr) should have implemented this as well ... - Karussell
Well, what lucene or solr do. I already did it. Now I have CSV file but my question is different. If you can explain your question. I can answer it in more better way - user238384
Somethihng seems amiss with the matrix. It has some weird non-symmetries to it. For your example cluster m[3,4] is -0.062 but m[4,3] is 0.611 and m[3,5] is 0.035 but m[5,3] is 0.699. - Geoff Reedy
GeoffReedy please read my last line. I said i edit this matrix to give you an idea what I want to do. The values can have problems - user238384

2 Answers

1
votes

I'm not familiar with TF/IDF, but the process can go wrong in many stages generally:

1, Did you remove stopwords?

2, Did you apply stemming? Porter stemmer for example.

3, Did you normalize frequencies for document length? (Maybe the TFIDF thing has a solution for that, I don't know)

4, Clustering is a discovery method but not a holy grail. The documents it retrieves as a group may be related more or less, but that depends on the data, tuning, clustering algorithm, etc.

What do you want to achieve? What is your setup? Good luck!

1
votes

My approach would be not to use pre-calculated similarity values at all, because the similarity between docs should be found by the clustering algorithm itself. I would simply set up a feature space with one column per term in the corpus, so that the number of columns equals the size of the vocabulary (minus stop word, if you want). Each feature value contains the relative frequency of the respective term in that document. I guess you could use tf*idf values as well, although I wouldn't expect that to help too much. Depending on the clustering algorithm you use, the discriminating power of a particular term should be found automatically, i.e. if a term appears in all documents with a similar relative frequency, then that term does not discriminate well between the classes and the algorithm should detect that.