3
votes

I come up with solution to programmaticlly create query to search for phrase with wildcards using this code:

public static Query createPhraseQuery(String[] phraseWords, String field) {
    SpanQuery[] queryParts = new SpanQuery[phraseWords.length];
    for (int i = 0; i < phraseWords.length; i++) {
        WildcardQuery wildQuery = new WildcardQuery(new Term(field, phraseWords[i]));
        queryParts[i] = new SpanMultiTermQueryWrapper<WildcardQuery>(wildQuery);
    }
    return new SpanNearQuery(queryParts,       //words
                             0,                //max distance
                             true              //exact order
    );
}

Example creation and call toString() method will output:

String[] phraseWords = new String[]{"foo*", "b*r"};
Query phraseQuery = createPhraseQuery(phraseWords, "text");
System.out.println(phraseQuery.toString());

outputs:

spanNear([SpanMultiTermQueryWrapper(text:foo*), SpanMultiTermQueryWrapper(text:b*r)], 0, true)

Which works great, and fast enough for most cases. For instance, if I create such query and search with it, It will output desired results, for example:

Sentence with foo bar.
Foolies beer drinkers.
...

And not something like:

Bar fooes.
Foo has bar.

I have mentioned that query work fast enough in most cases. Currently I have an index with size of aprox. 200GB and on average searching time is between 0.1 to 3 seconds. Depending on many factors like: cache, size of subsets of documents matching single word in phrase since lucene will perform set intersections between founded terms.

Example: Let supose I want to query phrase "an* karenjin*" (which I will split into ["an*", "karenjin*"] and than create query using createPhraseQuery method) and I want that it matches sentences containing: "ana karenjina", "ani karenjinoj", "ane karenjine", ... (different cases due croatian grammar).

This query is very slow that I haven't waited long enough to get results (over 1h) and sometimes causes GC overhead limit exceeded exception. This behaviour is somewhat expected since "an*" itself matches a huge number of documents. I am aware of that I could query "an? karanjin*" which giver results in 30-40sec (faster but still slow).

This is where I am confused. If I query just "karenjin*" it gives results in 1 sec. Therefore I have tried to query "an* karenjin*" and using a Filter "karenjin*" using WildcardQuery and QueryWrapperFilter. And it is still unacceptable slow (I killed process before it returned anythong).

Documentation says that Filter reduces search space of Query. So I tried to use filter:

Filter filter = new QueryWrapperFilter(new WildcardQuery(new Term("text", "karanjin*")));

And query:

Query query = createPhraseQuery(new String[]{"an*", "karenjin*"}, "text");

Than search, (after several warm-up queries):

Sort sort = new Sort(new SortField("insertTime", SortField.Type.STRING, true));
TopDocs docs = searcher.search(query, filter, 100, sort);

OK, what is my question?

How come is quering:

 Query query = new WildcardQuery(new Term("text", "karanjin*"));

is fast, but using Filter described above is still slow?

1

1 Answers

1
votes

Yes, wildcards can be performance hogs, especially if they match a lot of terms, but what you describe does seem surprisingly so. Hard to say for sure why that is occuring, but for an attempt.

I'll assume:

Query query = new WildcardQuery(new Term("text", "an*"));

On it's own, is performing very badly, as described. Since the wildcards you are looking for are both prefix style queries, it's a better idea to use a PrefixQuery instead.

Query query = new PrefixQuery(new Term("text", "an"));

Though I don't think that will make much of a difference if any at all. What might just make a different is changing you rewrite method. You could try limiting the number of Terms the query is rewritten into:

Query query = new PrefixQuery(new Term("text", "an"));
//or
//Query query = new WildcardQuery(new Term("text", "an*"));
query.setRewriteMethod(new MultiTermQuery.RewriteMethod.TopTermsRewrite(10));