Is there a non-obsolete Lucene.NET Analyzer that can do english language stemming or lemmatization or do I need to write a custom Analyzer?
I can't seem to find an Analyzer that includes PorterStemFilter
or EnglishMinimalStemFilter
in the source code. I could write my own Analyzer, but it feels like that shouldn't be required, and I'd be reinventing the wrong wheel.
I'm trying to do Stemming of english words in Lucene.NET. As far as I can tell, this does not work out of the box. I tried using the EnglishAnalizer like so:
[TestFixture]
public class TestAnalyzers
{
private const string FieldName = "CustomFieldName";
public Directory CreateDirectory(IEnumerable<string> documents, Analyzer analyzer)
{
var directory = new RAMDirectory();
var iwc = new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer)
{
OpenMode = OpenMode.CREATE_OR_APPEND,
};
var writer = new IndexWriter(directory, iwc);
writer.Commit();
foreach(var doc in documents) {
var document = new Document();
document.AddTextField(FieldName, doc, StoredField.Store.YES);
writer.AddDocument(document);
}
writer.Flush(true, true);
writer.Commit();
return directory;
}
private QueryParser CreateQueryParser(Analyzer analyzer)
=> new MultiFieldQueryParser(
LuceneVersion.LUCENE_48,
GetSearchFields(),
analyzer);
private string[] GetSearchFields() => new [] { FieldName };
[TestCase("for", "for")]
[TestCase("for", "forward")]
[TestCase("forward", "for")]
//[TestCase("retire", "retirement")]
[TestCase("retirement", "retire")]
[Test]
public void TestPartialWordsStandard(string fieldValue, string query)
{
var analyzer = new EnglishAnalyzer(LuceneVersion.LUCENE_48);
var directory = CreateDirectory(new [] { fieldValue }, analyzer);
var indexReader = DirectoryReader.Open(directory);
Assert.AreEqual(1, indexReader.NumDocs);
var doc = indexReader.Document(0);
Assert.NotNull(doc);
Assert.AreEqual(fieldValue, doc.GetField(FieldName).GetStringValue());
var searcher = new IndexSearcher(indexReader);
var queryObj = CreateQueryParser(analyzer).Parse(query);
var results = searcher.Search(queryObj, 2);
Assert.AreEqual(1, results.TotalHits);
doc = indexReader.Document(results.ScoreDocs.First().Doc);
Assert.AreEqual(fieldValue, doc.GetField(FieldName).GetStringValue());
}
}
It did no stemming. From reading the code it using a possessive filter to remove 's and s, but not the english stemming filter or the
PorterStemFilter`.
I was able to get some stemming to happen with var analyzer = new SnowballAnalyzer(LuceneVersion.LUCENE_48, "English");.
Its an adequate amount of stemming , but the class is obsolete.