Efficiently performing a bulk exact-match lookup in Lucene?

Question

tl;dr:

What's the best way to bulk-fetch documents from Lucene using an exact-match on a set of keys?

Long version:

We have a Lucene index persisted to disk that is read through a DirectoryReader.

It contains 2,000,000 documents with the schema:

{"key": "20-character-string", "value": "1-1000-character-string"}

We now need to perform the equivalent of a SELECT document WHERE document.key IN $keyArray -- i.e. return the subset of documents whose keys intersect the $keyArray (a 10,000-item array of keys) using an exact-match.

Is there a better way than performing 10,000 separate searches?

Lawrence Wagerfield Lawrence Wagerfield · Accepted Answer · 2020-10-20T15:29:41

You should use TermInSetQuery.

Under the hood it uses a sequence of BooleanQuery instances ORd together, if there are fewer than 16 terms in your set, else it uses something more efficient (presumably a hashset of sorts).

Efficiently performing a bulk exact-match lookup in Lucene?

1 Answers