Filter documents with less than N terms in Lucene query

Question

Is it possible, as a part of a Lucene query, to exclude from the results documents that have less than N terms, or are smaller than a given size?

The full story: I have a Lucene index with many documents. Some of them are large, other are very small, possibly a few words only. I want to run some tests, but only on documents of reasonable size. How can I filter out small documents? Currently, I am fetching the terms frequencies vector and dropping documents with less than N terms:

BooleanQuery q = some query...
TopDocs top = indexSearcher.search(q, size);
Collection<Integer> docNums = collectDocNums(top);
Iterator<Integer> it = docNums.iterator();
while (it.hasNext()) {
  int candDocNum = it.next();
  TermFreqVector tfv =
    indexReader.getTermFreqVector(candDocNum, "field");
  if (tfv.getTerms().length < N)
     it.remove();
}

Can this be done more efficiently, either by filtering in the query itself, or somehow batching the loop below it?

femtoRgon femtoRgon · Accepted Answer · 2012-12-19T20:53:26

A filter would be probably be a reasonable implementation. It sounds like such a filter would be reused frequently while searching, so a caching filter would be worthwhile. I don't know of any standard filter that accomplishes this, but a custom one would work nicely.

I'dd implement something like:

//Important to wrap the filter with a CachingWrapper, for performance.
filter = new CachingWrapperFilter(new CustomFilter());

public class CustomFilter() Extends Filter{
    public getDocIdSet(IndexReader reader) {
        return new CustomSet(reader);
    }
}

public class CustomSet(IndexReader reader) extends FilteredDocIdSet{
    public boolean match(int docid) {
        reader.getTermFreqVector(candDocNum, "field");
        return (tfv.getTerms().length >= N);
    }
}

Filter documents with less than N terms in Lucene query

2 Answers