6
votes

I am a starter in text mining topic. When I run LDA() over a huge dataset with 996165 observations, it displays the following error:

Error in LDA(dtm, k, method = "Gibbs", control = list(nstart = nstart, : Each row of the input matrix needs to contain at least one non-zero entry.

I am pretty sure that there is no missing values in my corpus and also. The table of "DocumentTermMatrix" and "simple_triplet_matrix" is:

table(is.na(dtm[[1]]))
#FALSE 
#57100956 

table(is.na(dtm[[2]]))
#FALSE 
#57100956 

A little confused how "57100956" comes. But as my dataset is pretty large, I don't know how to check why does this error occurs. My LDA command is:

ldaOut<-LDA(dtm,k, method="Gibbs", control=list(nstart=nstart, seed = seed, best=best, burnin = burnin, iter = iter, thin=thin))

Can anyone provide some insights? Thanks.

2

2 Answers

4
votes

In my opinion the problem is not the presence of missing values, but the presence of all 0 rows. To check it:

raw.sum=apply(table,1,FUN=sum) #sum by raw each raw of the table

Then you can delete all raws which are all 0 doing:

table=table[raw.sum!=0,]

Now table should has all "non 0" raws.

1
votes

I had the same problem. The design matrix, dtm, in your case, had rows with all zeroes because dome documents did not contain certain words (i.e. their frequency was zero). I suppose this somehow causes a singular matrix problem somewhere along the line. I fixed this by adding a common word to each of the documents so that every row would have at least one non-zero entry. At the very least, the LDA ran successfully and classified each of the documents. Hope this helps!