Lucene.Net not returning expected search results when "-" or wildcards are used

Question

I'm using Lucene.net 2.9, and trying to understand why my queries aren't returning the expected results.

I use the following function to add fields to the indexed documents.

//add fields to the document
public void AddFacet(Lucene.Net.Documents.Document doc, String facetName, String facetValue)
{
    doc.Add(new Lucene.Net.Documents.Field(facetName, facetValue, Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.NOT_ANALYZED));
}

//snippet of analyzer being used
Lucene.Net.Analysis.Analyzer analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29);

//snippet of a simple demo
Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document();
AddFacet(doc, "FACET", "INDEX-VALUE-TEST");

From what I understand, since I'm using Lucene.Net.Documents.Field.Index.NOT_ANALYZED when adding the fields to the document, the facetValue won't be tokenized into terms.

I believe this means that the original facetValue is stored as "INDEX-VALUE-TEST". If it were to be tokenized, it would be stored with multiple terms of "INDEX", "VALUE" and "TEST", since the analyzer interprets - as a stop word.

If I perform a search for "INDEX", my query will look like +(xml:index), which returns all documents that contain "INDEX" in any of their terms. This is expected.

I don't understand the following cases:

If I perform a search for "INDEX-VAL", my query will look like +(xml:index-val), which returns no results. I can see why this returns no results, since there is no wildcard.
If I perform a search for "INDE*", my query will look like +(xml:inde*), which again returns no results. I'm not sure why this doesn't return any documents. I would expect to get back the all documents that contain "INDE" in any of their fields.
If I search for "INDEX-VALUE-TEST", my query will look like +(xml:index-value-test). Again, no results. I would expect to get back 1 document.

If I stored the term as "INDEX-VALUE-TEST", then why doesn't case #2 and #3 return results? I can see why #1 wouldn't since it might need a wildcard to match the rest of the term. If that's the case, why can I search for "INDEX" with no wildcard and get all the documents?

I've been using this source to understand the indexing files.

I've been using this source to understand the fields I'm adding to the document.

If anyone could help me understand what I'm missing, it would be greatly appreciated.

Kevin Kevin · Accepted Answer · 2014-11-14T22:52:08

I think, that the right way to solve this would be to write our own parser/analyzer so that we had more control over what was happening. The level of effort couldn't be justified for now (perhaps until other issues pop up).

My work around was to replace all - with a whitespace when doing the search. It made the search results more consistent with what I was expecting. This should be fine, since the analyzer would normally tokenize this character in a consistent manner for non wildcard queries.

Lucene.Net not returning expected search results when "-" or wildcards are used

3 Answers