Is it possible, as a part of a Lucene query, to exclude from the results documents that have less than N terms, or are smaller than a given size?
The full story: I have a Lucene index with many documents. Some of them are large, other are very small, possibly a few words only. I want to run some tests, but only on documents of reasonable size. How can I filter out small documents? Currently, I am fetching the terms frequencies vector and dropping documents with less than N terms:
BooleanQuery q = some query...
TopDocs top = indexSearcher.search(q, size);
Collection<Integer> docNums = collectDocNums(top);
Iterator<Integer> it = docNums.iterator();
while (it.hasNext()) {
int candDocNum = it.next();
TermFreqVector tfv =
indexReader.getTermFreqVector(candDocNum, "field");
if (tfv.getTerms().length < N)
it.remove();
}
Can this be done more efficiently, either by filtering in the query itself, or somehow batching the loop below it?