I am trying to determine document similarity between a single document and each of a large number of documents (n ~= 1 million) as quickly as possible. More specifically, the documents I'm comparing are e-mails; they are grouped (i.e., there are folders or tags) and I'd like to determine which group is most appropriate for a new e-mail. Fast performance is critical.
My a priori assumption is that the cosine similarity between term vectors is appropriate for this application; please comment on whether this is a good measure to use or not!
I have already taken into account the following possibilities for speeding up performance:
Pre-normalize all the term vectors
Calculate a term vector for each group (n ~= 10,000) rather than each e-mail (n ~= 1,000,000); this would probably be acceptable for my application, but if you can think of a reason not to do it, let me know!
I have a few questions:
If a new e-mail has a new term never before seen in any of the previous e-mails, does that mean I need to re-compute all of my term vectors? This seems expensive.
Is there some clever way to only consider vectors which are likely to be close to the query document?
Is there some way to be more frugal about the amount of memory I'm using for all these vectors?
Thanks!