lucene ngram tokenizer usage for fuzzy phrase match

Question

I am trying to achieve fuzzy phrase search (to match misspelled words) by using lucene, by referring various blogs I thought to try ngram indexes on fuzzy phrase search.

But I couldn't find ngram tokenizer as part of my lucene3.4 JAR library, is it deprecated and replaced with something else ? - currently I am using standardAnalyzer where I am getting decent results for exact match of terms.

I have below two requirements to handle.

My index is having document with phrase "xyz abc pqr", when I provide query "abc xyz"~5, I am able to get results, but my requirement is to get results for same document even though I have one extra word like "abc xyz pqr tst" in my query (I understand match score will be little less) - using proximity extra word in phrase is not working, if I remove proximity and double quotes " " from my query, I am getting expected results (but there I get many false positives like documents containing only xyz, only abc etc.)

In same above example, if somebody misspell query "abc xxz", I still want to get results for same document.

I want to give a try with ngram but not sure it will work as expected.

Any thoughts ?

John Li John Li · Accepted Answer · 2012-02-29T02:38:25

Try to use BooleanQuery and FuzzyQuery like:

    public void fuzzysearch(String querystr) throws Exception{
        querystr=querystr.toLowerCase();

        System.out.println("\n\n-------- Start fuzzysearch -------- ");

        // 3. search
        int hitsPerPage = 10;
        TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
        IndexReader reader = IndexReader.open(index);

        IndexSearcher searcher = new IndexSearcher(reader);
        BooleanQuery bq = new BooleanQuery();

        String[] searchWords = querystr.split(" ") ;
        int id=0;
        for(String word: searchWords ){
            Query query = new FuzzyQuery(new Term(NAME,word));
            if(id==0){
                bq.add(query, BooleanClause.Occur.MUST);
            }else{
                bq.add(query, BooleanClause.Occur.SHOULD);
            }
          id++;
        }
        System.out.println("query ==> " + bq.toString());
        searcher.search(bq, collector );
        parseResults(  searcher, collector  ) ;
        searcher.close();
    }

public void parseResults(IndexSearcher searcher, TopScoreDocCollector collector  ) throws  Exception {
ScoreDoc[] hits = collector.topDocs().scoreDocs;

    // 4. display results
    System.out.println("Found " + hits.length + " hits.");
    for(int i=0;i<hits.length;++i) {
      int docId = hits[i].doc;
      Document d = searcher.doc(docId);
      System.out.println((i + 1) + ". " + d.get(NAME));
    }

}

lucene ngram tokenizer usage for fuzzy phrase match

1 Answers