Lucene - equivalent of SQL "IN" keyword

Question

Please excuse my novice question. I have tried searching for answers, but searching for this sort of thing is quite difficult given the keywords...

I am using Lucene 5.2.x to index a set of documents and each document has two fields: id and description.

I get a set of ids from previous query in the system. Now, I would like to get Lucene text search results on the description but only from documents in the set of ids. Were I doing this (naively) in MySQL, I might do something like:

SELECT * FROM mytable 
    WHERE description LIKE 'blah%' 
          AND 
          id IN (6345, 5759, 333, ...)

The set of ids maybe tens of thousands. What is the best approach to this with Lucene? Can I construct a Lucene query to handle this efficiently, or should I search my entire document index and then do a set intersection? Something else?

Thank you!

Denis Bazhenov Denis Bazhenov · Accepted Answer · 2015-10-09T00:36:49

I would like to get Lucene text search results on the description but only from documents in the set of ids.

You need to use BooleanQuery.

If you create query using QueryParser then use:

+(id:6345 id:5759 id:333 ...) +(description:"blah*")

If you create Query programmatically then code will be something like:

BooleanQuery ids = new BooleanQuery();
ids.add(new TermQuery(new Term("id", "6345")), SHOULD);
ids.add(new TermQuery(new Term("id", "5759")), SHOULD);
ids.add(new TermQuery(new Term("id", "333")), SHOULD);

BooleanQuery resultQuery = new BooleanQuery();
resultQuery.add(new PrefixQuery(new Term("description", "blah")), MUST);
resultQuery.add(ids, MUST);

The set of ids maybe tens of thousands.

BooleanQuery has built it limit for a maximum number of clauses (see org.apache.lucene.search.BooleanQuery#maxClauseCount). You will need to increase this limit using BooleanQuery.setMaxClauseCount(). This will require you to create queries programatically.

Can I construct a Lucene query to handle this efficiently, or should I search my entire document index and then do a set intersection? Something else?

As far as I know, inverted index is the most efficient way of searching, known to humankind at the moment. At least, from search time perspective (without considering indexing phase).

So, if efficiency is the concern, I recommend to move all search logic to Lucene (which is inverted index library). As a very mature library Lucene can search over almost all kind of information. So, probably, all your documents can be indexed in Lucene and all "previous queries" are also could be executed in Lucene.

In that case there will be no need to send thousand of ids to Lucene as a additional filter, which is indeed seems wasteful. Unless you have some unique search requirements, this is most efficient way of searching I can come up with.

Lucene - equivalent of SQL "IN" keyword

1 Answers