1
votes

I have been using quanteda for the past couple of months and really enjoy using the package. One question I have is how many rows of a dfm can the textstat_simil function handle before the time to create the similarity matrix becomes too long.

I have a search corpus containing 15 million documents. Each document is a short sentence containing anywhere from 5 to 10 words (the documents sometimes include some 3-4 digit numbers too). I have tokenized this search corpus using character bigrams and created a dfm from it.

I also have another corpus that I call the match corpus. It has a couple hundred documents of similar length, has had the same tokenization, and a dfm created for it also. The aim is to find the closest matching document from the search corpus for each of the match corpus documents.

A combined dfm is made by rbinding the match dfm with the search dfm. The number of unique tokens for the combined dfm is about 1580. I then run textstat_simil on this combined dfm using "cosine" method, "documents" as the margin, and the selection being just one of the match corpus documents for now to test. However, when I run textstat_simil it takes over 5 minutes to run.

Is this sort of volume too much for this type of approach using quanteda?

Cheers, Sof

1
I think you might have to show some example code for this to be at all clear to even someone an expert in the subject matter.user1531971

1 Answers

0
votes

In quanteda v1.3.13, we reprogrammed the function for computing cosine similarities so that is more efficient for memory and for storage. However it sounds like you are still trying to get a document-by-document distance matrix (excluding the diagonal) that will be (15000000^2)/2 - 150000000 = 1.124998e+14 cells in size. If you are able to get this to run at all, I'm very impressed with your machine!

For your 1,850 target document set, however, you can narrow this down by using the selection argument.

Also, look for the experimental textstat_proxy() function in v1.3.13, which we created for this sort of problem. You can specify a minimum distance below which a distance will not be recorded, and it returns a distance matrix using a sparse matrix object. This is still experimental because the sparse values are not zeroes, but will be treated as zeroes by any operations on the sparse matrix. (This violates some distance properties - see the discussion here.)