Lucene does not index some terms in documents

Question

I have been trying to use Lucene to index our code database. Unfortunately, some terms get omitted from the index. E.g. in the below string, I can search on anything other than "version-number":

version-number "cAELimpts.spl SCOPE-PAY:10.1.10 25nov2013kw101730 Setup EMployee field if missing"

I have tried implementing it with both Lucene.NET 3.1 and pylucene 6.2.0, with the same result.

Here are some details of my implementation in Lucene.NET:

using (var writer = new IndexWriter(FSDirectory.Open(INDEX_DIR), new CustomAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED))
{
  Console.Out.WriteLine("Indexing to directory '" + INDEX_DIR + "'...");
  IndexDirectory(writer, docDir);
  Console.Out.WriteLine("Optimizing...");
  writer.Optimize();
  writer.Commit();
}

The CustomAnalyzer class:

public sealed class CustomAnalyzer : Analyzer
{
    public override TokenStream TokenStream(System.String fieldName, System.IO.TextReader reader)
    {
        return new LowerCaseFilter(new CustomTokenizer(reader));
    }
}

Finally, the CustomTokenizer class:

public class CustomTokenizer : CharTokenizer
{
    public CustomTokenizer(TextReader input) : base(input)
    {
    }

    public CustomTokenizer(AttributeFactory factory, TextReader input) : base(factory, input)
    {
    }

    public CustomTokenizer(AttributeSource source, TextReader input) : base(source, input)
    {
    }

    protected override bool IsTokenChar(char c)
    {
        return System.Char.IsLetterOrDigit(c) || c == '_' || c == '-' ;
    }
}

It looks like "version-number" and some other terms are not getting indexed because they are present in 99% of the documents. Can it be the cause of the problem?

EDIT: As requested, the FileDocument class:

public static class FileDocument
{
    public static Document Document(FileInfo f)
    {

        // make a new, empty document
        Document doc = new Document();

        doc.Add(new Field("path", f.FullName, Field.Store.YES, Field.Index.NOT_ANALYZED));
        doc.Add(new Field("modified", DateTools.TimeToString(f.LastWriteTime.Millisecond, DateTools.Resolution.MINUTE), Field.Store.YES, Field.Index.NOT_ANALYZED));
        doc.Add(new Field("contents", new StreamReader(f.FullName, System.Text.Encoding.Default)));

        // return the document
        return doc;
    }
}

so you wrote a custom analyzer & that is not working as desired? What is the value of version-number that you tried index , the one long value shown in question? You have not shown your Document structure, provide that part. — Sabir Khan
I have added the FileDocument class to my question. I have tried StandardAnalyzer before creating the custom one. It is very simple and I would expect it to index all documents that contain the term "version-string" as part of the "contents" field. — n.jmurov
This is also interesting. When I search for "bill-of-materials", Lucene search does not produce any results (grepping finds a few hundred matches). However, when I search for "delete bill-of-materials", both Lucene and grep find the same number of files (about 10). What's going on here? How can I make Lucene and grep search results to be the same? — n.jmurov

n.jmurov n.jmurov · Accepted Answer · 2017-06-15T10:27:47

I think I was being an idiot. I was limiting the number of hits to 500 and then applying filters on the found hits. The items were expected to be retrieved in the order they had been indexed. So when I was looking for something at the end of the index, it would tell me that nothing was found. In fact, it would retrieve the expected 500 items but they would all have been filtered out.

Lucene does not index some terms in documents

1 Answers