2
votes

I'm trying to cluster similar documents using the R language. As a first step, I compute the term-document matrix for my set of documents. Then I create the latent semantic space for the term-document matrix previously created. I decided to use use LSA in my expriment because the results of clustering using just the term-document matrix were awful . Is possible to build a dissimilarity matrix (with cosine measure) using the the LSA space created? I need to do this because the clustering algorithm that I'm using requires a dissimilarity matrix as input.

Here is my code:

require(cluster);
require (lsa);

myMatrix = textmatrix("/home/user/DocmentsDirectory");
myLSAspace = lsa(myMatrix, dims=dimcalc_share());

I need to build a dissimilarity matrix (using cosine measure) from LSA space, so I can call the cluster algorithm as follows:

clusters = pam(dissimilartiyMatrix,10,diss=TRUE);

Any suggestions?

Thanks in advance!

2

2 Answers

5
votes

To compare two documents in the LSA-space, you can take the cross product of the $sk and $dk matrices that lsa() returns to get all the documents in the lower dimensional LSA-space. Here's what I did:

lsaSpace <- lsa(termDocMatrix)

# lsaMatrix now is a k x (num doc) matrix, in k-dimensional LSA space
lsaMatrix <- diag(lsaSpace$sk) %*% t(lsaSpace$dk)

# Use the `cosine` function in `lsa` package to get cosine similarities matrix
# (subtract from 1 to get dissimilarity matrix)
distMatrix <- 1 - cosine(lsaMatrix)

See http://en.wikipedia.org/wiki/Latent_semantic_analysis, where it says you can now use LSA results to "see how related documents j and q are in the low dimensional space by comparing the vectors sk*d_j and sk*d_q (typically by cosine similarity)."

2
votes

You can use package arules , here an example:

 library(arules)
 dissimilarity(x=matrix(seq(1,10),ncol=2),method='cosine')
          1         2         3         4
2 -4.543479                              
3 -4.811989 -5.231234                    
4 -5.080052 -5.563952 -6.024433          
5 -5.343350 -5.885304 -6.395740 -6.877264