1
votes

Is it possible, as a part of a Lucene query, to exclude from the results documents that have less than N terms, or are smaller than a given size?

The full story: I have a Lucene index with many documents. Some of them are large, other are very small, possibly a few words only. I want to run some tests, but only on documents of reasonable size. How can I filter out small documents? Currently, I am fetching the terms frequencies vector and dropping documents with less than N terms:

BooleanQuery q = some query...
TopDocs top = indexSearcher.search(q, size);
Collection<Integer> docNums = collectDocNums(top);
Iterator<Integer> it = docNums.iterator();
while (it.hasNext()) {
  int candDocNum = it.next();
  TermFreqVector tfv =
    indexReader.getTermFreqVector(candDocNum, "field");
  if (tfv.getTerms().length < N)
     it.remove();
}

Can this be done more efficiently, either by filtering in the query itself, or somehow batching the loop below it?

2

2 Answers

1
votes

A filter would be probably be a reasonable implementation. It sounds like such a filter would be reused frequently while searching, so a caching filter would be worthwhile. I don't know of any standard filter that accomplishes this, but a custom one would work nicely.

I'dd implement something like:

//Important to wrap the filter with a CachingWrapper, for performance.
filter = new CachingWrapperFilter(new CustomFilter());

public class CustomFilter() Extends Filter{
    public getDocIdSet(IndexReader reader) {
        return new CustomSet(reader);
    }
}

public class CustomSet(IndexReader reader) extends FilteredDocIdSet{
    public boolean match(int docid) {
        reader.getTermFreqVector(candDocNum, "field");
        return (tfv.getTerms().length >= N);
    }
}
0
votes

Have a look at PositiveScoresOnlyCollector: it only collects documents which have score > 0. You could probably write your own similar collector only accepting documents which have score > X.

Above is of course only applicable if you can find some relationship between N and X. To my understanding these two things should correlate: the less there are matched terms, the lesser is the score and vice versa.

If you could define some min score threshold, this approach should be more efficient than the one you're currently using.