2
votes

I'm using Lucene.net 2.9, and trying to understand why my queries aren't returning the expected results.

I use the following function to add fields to the indexed documents.

//add fields to the document
public void AddFacet(Lucene.Net.Documents.Document doc, String facetName, String facetValue)
{
    doc.Add(new Lucene.Net.Documents.Field(facetName, facetValue, Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.NOT_ANALYZED));
}

//snippet of analyzer being used
Lucene.Net.Analysis.Analyzer analyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29);

//snippet of a simple demo
Lucene.Net.Documents.Document doc = new Lucene.Net.Documents.Document();
AddFacet(doc, "FACET", "INDEX-VALUE-TEST");

From what I understand, since I'm using Lucene.Net.Documents.Field.Index.NOT_ANALYZED when adding the fields to the document, the facetValue won't be tokenized into terms.

I believe this means that the original facetValue is stored as "INDEX-VALUE-TEST". If it were to be tokenized, it would be stored with multiple terms of "INDEX", "VALUE" and "TEST", since the analyzer interprets - as a stop word.

If I perform a search for "INDEX", my query will look like +(xml:index), which returns all documents that contain "INDEX" in any of their terms. This is expected.

I don't understand the following cases:

  1. If I perform a search for "INDEX-VAL", my query will look like +(xml:index-val), which returns no results. I can see why this returns no results, since there is no wildcard.

  2. If I perform a search for "INDE*", my query will look like +(xml:inde*), which again returns no results. I'm not sure why this doesn't return any documents. I would expect to get back the all documents that contain "INDE" in any of their fields.

  3. If I search for "INDEX-VALUE-TEST", my query will look like +(xml:index-value-test). Again, no results. I would expect to get back 1 document.

If I stored the term as "INDEX-VALUE-TEST", then why doesn't case #2 and #3 return results? I can see why #1 wouldn't since it might need a wildcard to match the rest of the term. If that's the case, why can I search for "INDEX" with no wildcard and get all the documents?

I've been using this source to understand the indexing files.

I've been using this source to understand the fields I'm adding to the document.

If anyone could help me understand what I'm missing, it would be greatly appreciated.

3

3 Answers

2
votes

I think, that the right way to solve this would be to write our own parser/analyzer so that we had more control over what was happening. The level of effort couldn't be justified for now (perhaps until other issues pop up).

My work around was to replace all - with a whitespace when doing the search. It made the search results more consistent with what I was expecting. This should be fine, since the analyzer would normally tokenize this character in a consistent manner for non wildcard queries.

0
votes

From what I understand if a field is indexed using Lucene.Net.Documents.Field.Index.NOT_ANALYZED, then the search will be case sensitive. If you change your search string to uppercase you may get results coming back for case #2 and #3. If the analyzer you are using when searching converts everything to lower case, then in turn you may need to index the field as a lowercase string.

For case #3 you may also need to escape the '-' dash characters in the search query, so the search becomes +(xml:INDEX\-VALUE\-TEST). As searching using the '-' character could be interpreted as a boolean operator.

0
votes

You wrote:

the analyzer interprets - as a stop word.

StandardAnalyzer tokenizes text using StandardTokenizer, which interprets the hyphen (-) as punctuation, not as a stop word. In practical terms the result is the same: it drops the hyphen.

StandardTokenizer will tokenize the query expression "INDEX-VALUE-TEST" into three tokens:

( INDEX, VALUE, TEST )

which cannot match on the single token in your index:

( INDEX-VALUE-TEST )

This treatment of hyphen is expected to change in future, when Lucene applies the Unicode segmentation rules in UAX 29. But the problem here is not punctuation because the hyphen in "INDEX-VALUE-TEST" is not a punctuation character.

Anyway it looks like you are searching on field "xml" instead of "FACET", because the query parser yields this query:

+(xml:index-value-test)

I'd say that "xml" is the default search field for your index.

You may get the behaviour that you want with WhitespaceAnalyzer, but I would suggest using KeywordAnalyzer, which seems closer to your intention (treat the whole field as a single token).

Note that you must use the same analyzer for indexing and searching. And when using the query parser, specify the relevant field:

FACET:INDEX-VALUE-TEST

If you need StandardAnalyzer for other fields in addition to KeywordAnalyzer for FACET, you could use a PerFieldAnalyzerWrapper.