Scrub Lucene search terms with the Standard Analyzer

Question

We are building a bool query out of search term strings to search our Lucene indexes. I would like these strings to be analyzed with the Standard Analyzer, the analyzer we are using for our indexes. For example, foo-bar 1-2-3 should be broken up as foo, bar, 1-2-3 since the Lucene doc states that hyphens cause numbers to stay together but words to be tokenized. What is the best way to do this?

Currently I am running my search term strings through a QueryParser.

QueryParser parser = new QueryParser("", new StandardAnalyzer()); 
Query query = parser.parse(aSearchTermString);

The problem with this is that quotes are inserted. For example, foo-bar 1-2-3 becomes "foo bar", 1-2-3, which does not return anything because Lucene would have tokenized foo-bar into foo and bar.

I definitely don't want to hack this situation by removing the quotes with replace because I feel that I am probably missing something or doing something incorrectly.

mindas mindas · Accepted Answer · 2013-01-22T22:23:59

I am actually getting different results for StandardAnalyzer. Consider this code (using Lucene v4):

public class Tokens {

    private static void printTokens(String string, Analyzer analyzer) throws IOException {
        System.out.println("Using " + analyzer.getClass().getName());
        TokenStream ts = analyzer.tokenStream("default", new StringReader(string));
        OffsetAttribute offsetAttribute = ts.addAttribute(OffsetAttribute.class);
        CharTermAttribute charTermAttribute = ts.addAttribute(CharTermAttribute.class);

        while(ts.incrementToken()) {
            int startOffset = offsetAttribute.startOffset();
            int endOffset = offsetAttribute.endOffset();
            String term = charTermAttribute.toString();
            System.out.println(term + " (" + startOffset + " " + endOffset + ")");
        }
        System.out.println();
    }

    public static void main(String[] args) throws IOException {
        printTokens("foo-bar 1-2-3", new StandardAnalyzer(Version.LUCENE_40));
        printTokens("foo-bar 1-2-3", new ClassicAnalyzer(Version.LUCENE_40));

        QueryParser standardQP = new QueryParser(Version.LUCENE_40, "", new StandardAnalyzer(Version.LUCENE_40));
        BooleanQuery q1 = (BooleanQuery) standardQP.parse("someField:(foo\\-bar\\ 1\\-2\\-3)");
        System.out.println(q1.toString() + "     # of clauses:" + q1.getClauses().length);
    }
}

Above prints:

Using org.apache.lucene.analysis.standard.StandardAnalyzer
foo (0 3)
bar (4 7)
1 (8 9)
2 (10 11)
3 (12 13)

Using org.apache.lucene.analysis.standard.ClassicAnalyzer
foo (0 3)
bar (4 7)
1-2-3 (8 13)

someField:foo someField:bar someField:1 someField:2 someField:3     # of clauses:5

So above code proves that StandardAnalyzer, unlike for example ClassicAnalyzer, should be splitting 1-2-3 into different tokens - exactly as you want. For queries, you need to escape every keyword, including space, otherwise QP thinks this has a different meaning.

If you don't want to escape your query string, you can always tokenize it manually (like in printTokens method above), then wrap each token with a TermQuery and stack all TermQueries into a BooleanQuery.

Scrub Lucene search terms with the Standard Analyzer

1 Answers