2
votes

I am working on indexing large text file with text without spaces. Currently i have ngram method to generate string of length 12 and then i index them. Same way to search,i get the string from the user generate ngrams of 12 and then use it in building the query. On searching,read about ngram tokenizer present in lucene. But couldnt find much oof any examples.

How to implement ngram tokenizer in lucene 4.0 ??

1

1 Answers

7
votes

Probably the simplest way to use NGramTokenizer is with this constructor the just takes a reader, and min and max gram size. You can incorporate it into an analyzer, similar to the example on the Analyzer docs. Something like:

Analyzer analyzer = new Analyzer() {
 @Override
  protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
    Tokenizer source = new NGramTokenizer(reader, 12, 12);
    TokenStream filter = new LowercaseFilter(source);
    return new TokenStreamComponents(source, filter);
  }
};