4
votes

Let's say we have a Lucene index having few documents indexed using StopAnalyzer.ENGLISH_STOP_WORDS_SET. A user is issuing two queries:

  • foo:bar
  • baz:"there is"

Let's assume that the first query yields some results because there are documents matching that query.

The second query yields 0 results. The reason for this is because when baz:"there is" is parsed, it ends up as a void query as both there and is are stopwords (technically speaking, this is converted to an empty BooleanQuery having no clauses). So far so good.

However, any of the following combined queries

  • +foo:bar +baz:"there is"
  • foo:bar AND baz:"there is"

behave exactly the same way as query +foo:bar, that is, brings back some results - all despite the second AND part which yields no results.

One might argue that when ANDing, both conditions have to be met, but they aren't.

It seems contradictory as an atomic query component has different impact on the overall query depending on the context. Is there any logical explanation for this? Can this be addressed in any way, preferably without writing own QueryAnalyzer? Can this be classified as a Lucene bug?

If this makes any difference, observed behaviour happens under Lucene v3.0.2.

This question was also posted on Lucene Java users mailing list, no answers came so far.

3

3 Answers

0
votes

I would suggest not using the StopAnalyzer if you want to be able to search for phrases like "there is". StopAnalyzer is essentially a lossy optimization method and unless you are indexing huge text documents it's probably not worth it.

0
votes

I think it is perfectly fine. You can imagine the result for an empty query being the whole document collection. However, this result is omitted for practical reasons. Sone basically you're ANDing with superset not an empty set.

E: You can think of it in a way that additional keywords refine the result set. This makes most sense when you take prefix search into account. The shorter your prefix is, the more matches there are. The most extreme case would be the empty query matching the whole document collection

0
votes

Erick Ericksson from Lucene mailing list answered one part of this question:

But imagine the impact of what you're requesting. If all stop words get removed, then no query would ever match yours. Which would be very counter-intuitive IMO. Your users have no clue that you've removed stopwords, so they'll sit there saying "Look, I KNOW that "bar" was in foo and I KNOW that "there is" was in baz, why the heck didn't this cursed system find my doc?

So it looks like the only sensible way is to cease using stopwords or reduce the stopword set.