I have been using quanteda for the past couple of months and really enjoy using the package. One question I have is how many rows of a dfm can the textstat_simil function handle before the time to create the similarity matrix becomes too long.
I have a search corpus containing 15 million documents. Each document is a short sentence containing anywhere from 5 to 10 words (the documents sometimes include some 3-4 digit numbers too). I have tokenized this search corpus using character bigrams and created a dfm from it.
I also have another corpus that I call the match corpus. It has a couple hundred documents of similar length, has had the same tokenization, and a dfm created for it also. The aim is to find the closest matching document from the search corpus for each of the match corpus documents.
A combined dfm is made by rbinding the match dfm with the search dfm. The number of unique tokens for the combined dfm is about 1580. I then run textstat_simil on this combined dfm using "cosine" method, "documents" as the margin, and the selection being just one of the match corpus documents for now to test. However, when I run textstat_simil it takes over 5 minutes to run.
Is this sort of volume too much for this type of approach using quanteda?
Cheers, Sof