2
votes

Am new to Lucene.Net Which is the best Analyzer to use in Lucene.Net? Also,I want to know how to use Stop words and word stemming features ?

3

3 Answers

1
votes

I'm also new to Lucene.Net, but I do know that the Simple Analyzer omits any stop words, and indexes all tokens/works.

Here's a link to some Lucene info, by the way, the .NET version is an almost perfect, byte-for-byte rewrite of the Java version, so the Java documentation should work fine in most cases: http://darksleep.com/lucene/. There's a section in there about the three analyzers, Simple, Stop, and Standard.

I'm not sure how Lucene.Net handles word stemming, but this link, http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html?page=2, demonstrates how to create your own Analyzer in Java, and uses a PorterStemFilter to do word-stemming.

...[T]he Porter stemming algorithm (or "Porter stemmer") is a process for removing the more common morphological and inflexional endings from words in English

I hope that is helpful.

0
votes

The best analyzer which i found is the StandardAnalyzer in which you can specify the stopwords also. For Example :-

        string indexFileLocation = @"C:\Index";
        string stopWordsLocation = @"C:\Stopwords.txt";
        var directory = FSDirectory.Open(new DirectoryInfo(indexFileLocation));
        Analyzer analyzer = new StandardAnalyzer(
            Lucene.Net.Util.Version.LUCENE_29, new FileInfo(stopWordsLocation));
0
votes

It depends on your requirements. If your requirements are ultra simple - e.g. case insensitve, non-stemming searches - then StandardAnalyzer is a good choice. If you look into the Analyzer class and get familiar with Filters, particulary TokenFilter, you can exert an enormous amount of control over your index by rolling your own analyzer.

Stemmers are tricky, and it's important to have a deep understanding of what type of stemming you really need. I've used the Snowball stemmers. For example, the word "policy" and "police" have the same root in the English Snowball stemmer, and getting hits on documents with "policy" when the search term "police" isn't so hot. I've implemented strategies to support stemmed and non-stemmed search so that may be avoided, but it's important to understand the impact.

Beware of temptations like stop words. If you need to search for the phrase "to be or not to be" and the standard stop words are enabled, your search will fail to find documents with that phrase.