I'm curious as to how companies generally compute the cosine similarity quickly among an entire corpus. As an example, if someone searched for the terms "funny cats", and there are 100,000 documents that have at least one of those terms, calculating the cosine similarity on the fly between the query vector and those 100,000 document vectors can take a long time. Is there a general strategy for caching or speeding this search up?
1 Answers
0
votes
First of all, computing cosine similarity, which is just a dot product, among highly sparse vectors is... fast... 100,000 vectors in sparse format is truly nothing.
Although there are many methods which can speed things up by sacrificing precision. Some of such methods is LSH (Locally Sensitive Hashing) where you very quickly "filter out" vectors which are not similar enough to be considered. It appears that you can build a search index based on random hyperplanes, which with high probability will aproximate cosine similarity when used. For more details refer to "Mining Massive Datasets" by Ullman et al.