Lucene: the same query parsed from String and build via Query API doesn't yield same results

Question

I have the following code:

public static void main(String[] args) throws Throwable {
    String[] texts = new String[]{
            "starts_with k mer",
            "starts_with mer",
            "starts_with bleue est mer",
            "starts_with mer est bleue",
            "starts_with mer bla1 bla2 bla3 bla4 bla5",
            "starts_with bleue est la mer",
            "starts_with la mer est bleue",
            "starts_with la mer"
    };


    //write:
    Set<String> stopWords = new HashSet<String>();
    StandardAnalyzer stdAn = new StandardAnalyzer(Version.LUCENE_36, stopWords);
    Directory fsDir = FSDirectory.open(INDEX_DIR);
    IndexWriterConfig iwConf  = new IndexWriterConfig(Version.LUCENE_36,stdAn);
    iwConf.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
    IndexWriter indexWriter = new IndexWriter(fsDir,iwConf);
    for(String text:texts) {
         Document document = new Document();
         document.add(new Field("title",text,Store.YES,Index.ANALYZED));
         indexWriter.addDocument(document);
    }
    indexWriter.commit();

    //read
    IndexReader indexReader = IndexReader.open(fsDir);
    IndexSearcher indexSearcher = new IndexSearcher(indexReader);

    //get query:
    //Query query = getQueryFromString("mer");
    Query query = getQueryFromAPI("mer");

    //explain
    System.out.println("======== Query: "+query+"\n");
    TopDocs hits = indexSearcher.search(query, 10);
    for (ScoreDoc scoreDoc : hits.scoreDocs) {
        Document doc = indexSearcher.doc(scoreDoc.doc);
        System.out.println(">>> "+doc.get("title"));
        System.out.println("Explain:");
        System.out.println(indexSearcher.explain(query, scoreDoc.doc));
    }
}

private static Query getQueryFromString(String searchString) throws Throwable {
    Set<String> stopWords = new HashSet<String>();
    Query query = new QueryParser(Version.LUCENE_36, "title",new StandardAnalyzer(Version.LUCENE_36, stopWords)).parse("("+searchString+") \"STARTS_WITH "+searchString+"\"");
    return query;
}

private static Query getQueryFromAPI(String searchString) throws Throwable {
    Set<String> stopWords = new HashSet<String>();
    Query searchStringTermsMatchTitle = new QueryParser(Version.LUCENE_36, "title", new StandardAnalyzer(Version.LUCENE_36, stopWords)).parse(searchString);

    PhraseQuery titleStartsWithSearchString = new PhraseQuery();
    titleStartsWithSearchString.add(new Term("title","STARTS_WITH".toLowerCase()+" "+searchString));
    BooleanQuery query = new BooleanQuery(true);

    BooleanClause matchClause = new BooleanClause(searchStringTermsMatchTitle, Occur.SHOULD);
    query.add(matchClause);     
    BooleanClause startsWithClause = new BooleanClause(titleStartsWithSearchString, Occur.SHOULD);
    query.add(startsWithClause);

    return query;
}

Basically I'm indexing some strings, and then I have two methods for creating a Lucene Query from user input, one that simply builds the corresponding Lucene query String "manually" (via string concatenation) and another that uses Lucene's API for building queries. They seem to be building the same query, as the debug output of the query shows the exact same query string, but the search results are not the same:

running the query built via String concatenation yields (for argument "mer"):

title:mer title:"starts_with mer"

and ideed in this case when I search with it I get documents that match the title:"starts_with mer" part first. Here's the explain on the first result:

>>> starts_with mer
Explain:
1.2329358 = (MATCH) sum of:
  0.24658716 = (MATCH) weight(title:mer in 1), product of:
    0.4472136 = queryWeight(title:mer), product of:
      0.882217 = idf(docFreq=8, maxDocs=8)
      0.50692016 = queryNorm
    0.55138564 = (MATCH) fieldWeight(title:mer in 1), product of:
      1.0 = tf(termFreq(title:mer)=1)
      0.882217 = idf(docFreq=8, maxDocs=8)
      0.625 = fieldNorm(field=title, doc=1)
  0.9863486 = (MATCH) weight(title:"starts_with mer" in 1), product of:
    0.8944272 = queryWeight(title:"starts_with mer"), product of:
      1.764434 = idf(title: starts_with=8 mer=8)
      0.50692016 = queryNorm
    1.1027713 = fieldWeight(title:"starts_with mer" in 1), product of:
      1.0 = tf(phraseFreq=1.0)
      1.764434 = idf(title: starts_with=8 mer=8)
      0.625 = fieldNorm(field=title, doc=1)

running the query built via Lucene query helper tools yields an apparently identical query:

title:mer title:"starts_with mer"

but this time the results are not the same, since in fact the title:"starts_with mer" part is not matched. Here's an explain of the first result:

>>> starts_with mer
Explain:
0.15185544 = (MATCH) sum of:
  0.15185544 = (MATCH) weight(title:mer in 1), product of:
    0.27540696 = queryWeight(title:mer), product of:
      0.882217 = idf(docFreq=8, maxDocs=8)
      0.312176 = queryNorm
    0.55138564 = (MATCH) fieldWeight(title:mer in 1), product of:
      1.0 = tf(termFreq(title:mer)=1)
      0.882217 = idf(docFreq=8, maxDocs=8)
      0.625 = fieldNorm(field=title, doc=1)

My question is: whay don't I get the same results? I'd really like to be able to use the Query helper tools here, especially since there's the BooleanQuery(disableCoord) option which I'd like to use and I really don't know how to express direclly into Lucene query string. (Yes, my example passes "true" there, I've also tried with "false", same result).

===UPDATE

femtoRgon's answer is great: the problem was that I was adding the whole search string as a term, instead of first splitting it into terms and then adding each one to the query.

The answer femtoRgon gives works ok if the input string consists of one term: in this case, separatedly adding the "STARTS_WITH" text as one term, and then adding the search string as a 2nd term works.

However if the user inputs something that would be tokenzied by more than one term, you'd have to first split it into terms (preferably using the same analyzers and/or tokenizers that you used when indexing - to get consistent results) and then add each term to the query.

What I ended up doing is making a function that splits the query string into terms, using the same analyzer that I used for indexing:

private static List<String> getTerms(String text) throws Throwable {
    Analyzer analyzer = getAnalyzer();      
    StringReader textReader = new StringReader(text);
    TokenStream tokenStream = analyzer.tokenStream(FIELD_NAME_TITLE, textReader);
    tokenStream.reset();        
    List<String> terms = new ArrayList<String>();
    CharTermAttribute charTermAttribute = tokenStream.addAttribute(CharTermAttribute.class);
    while (tokenStream.incrementToken()) {
        String term = charTermAttribute.toString();
        terms.add(term);
    }
    textReader.close();
    tokenStream.close();
    analyzer.close();       
    return terms;
}

Then I first add the "STARTS_WITH" as one term, and then each of the elements in the list as a separate term:

PhraseQuery titleStartsWithSearchString = new PhraseQuery();
titleStartsWithSearchString.add(new Term("title","STARTS_WITH".toLowerCase()));
for(String term:getTerms(searchString)) {
    titleStartsWithSearchString.add(new Term("title",term));
}

femtoRgon femtoRgon · Accepted Answer · 2013-02-25T22:50:42

I believe the problem you are running into is that you are adding the entire phrase to your PhraseQuery as a single term. In the index, and in the query parsed by the QueryParser, this will be split into terms "starts_with" and "mer", which must be found consecutively. However, in the query you have constructed, you have a single term in your PhraseQuery instead, the term "starts_with mer", which doesn't exist as a single term in the index.

You should be able to change the bit where you are constructing the PhraseQuery to:

PhraseQuery titleStartsWithSearchString = new PhraseQuery();
titleStartsWithSearchString.add(new Term("title","STARTS_WITH".toLowerCase())
titleStartsWithSearchString.add(new Term("title",searchString));

Lucene: the same query parsed from String and build via Query API doesn't yield same results

1 Answers