2
votes

Is there a non-obsolete Lucene.NET Analyzer that can do english language stemming or lemmatization or do I need to write a custom Analyzer?

I can't seem to find an Analyzer that includes PorterStemFilter or EnglishMinimalStemFilter in the source code. I could write my own Analyzer, but it feels like that shouldn't be required, and I'd be reinventing the wrong wheel.

I'm trying to do Stemming of english words in Lucene.NET. As far as I can tell, this does not work out of the box. I tried using the EnglishAnalizer like so:

[TestFixture]
public class TestAnalyzers
{
    private const string FieldName = "CustomFieldName"; 

    public Directory CreateDirectory(IEnumerable<string> documents, Analyzer analyzer)
    {
        var directory = new RAMDirectory();
        var iwc = new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer)
        {
            OpenMode = OpenMode.CREATE_OR_APPEND,
        };
        var writer = new IndexWriter(directory, iwc);
        writer.Commit();
        foreach(var doc in documents) {
            var document = new Document();
            document.AddTextField(FieldName, doc, StoredField.Store.YES);
            writer.AddDocument(document);
        }

        writer.Flush(true, true);
        writer.Commit();
        return directory;
    }

    private QueryParser CreateQueryParser(Analyzer analyzer) 
        => new MultiFieldQueryParser(
        LuceneVersion.LUCENE_48,
        GetSearchFields(),
        analyzer);

    private string[] GetSearchFields() => new [] { FieldName };



    [TestCase("for", "for")]
    [TestCase("for", "forward")]
    [TestCase("forward", "for")]
    //[TestCase("retire", "retirement")]
    [TestCase("retirement", "retire")]
    [Test]
    public void TestPartialWordsStandard(string fieldValue, string query)
    {
        var analyzer = new EnglishAnalyzer(LuceneVersion.LUCENE_48);
        var directory = CreateDirectory(new [] { fieldValue }, analyzer);

        var indexReader = DirectoryReader.Open(directory);
        Assert.AreEqual(1, indexReader.NumDocs);
        var doc = indexReader.Document(0);
        Assert.NotNull(doc);
        Assert.AreEqual(fieldValue, doc.GetField(FieldName).GetStringValue());


        var searcher = new IndexSearcher(indexReader);

        var queryObj = CreateQueryParser(analyzer).Parse(query);

        var results = searcher.Search(queryObj, 2);

        Assert.AreEqual(1, results.TotalHits);
        doc = indexReader.Document(results.ScoreDocs.First().Doc);
        Assert.AreEqual(fieldValue, doc.GetField(FieldName).GetStringValue());

    }
}

It did no stemming. From reading the code it using a possessive filter to remove 's and s, but not the english stemming filter or thePorterStemFilter`.

I was able to get some stemming to happen with var analyzer = new SnowballAnalyzer(LuceneVersion.LUCENE_48, "English");. Its an adequate amount of stemming , but the class is obsolete.

1

1 Answers

1
votes

The Lucene.Net EnglishAnalyzer does include porter stemming. In line 117 of the source code for the class is this line:

result = new PorterStemFilter(result);

I also ran a test in my system using the EnglishAnalyzer and confirmed that it is in fact stemming. So for example my indexed text contained the word "walking" and when I searched on "walked" I got a hit on the record.