0
votes

I have an array of searchable terms, and I want to use Lucene to basically CTRL-F through this stack of documents and find and store the locations of all of those terms within that stack of documents. For example:

Terms: "A", "B", "C"

Doc1: "CREATION" Doc2: "A BIG CAR" Doc3: "DOUBLE TROUBLE"

If I query the letter "A", I want to be able to say that there are 3 "A"s:

  • Doc1 at position 4
  • Doc2 at position 1
  • Doc2 at position 8

Something like that. How can I do this? So far, I'm just using a StandardAnalyzer like so:

public Analyzer _analyzer = new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30);

// for some directory defined here

using (var indexWriter = new IndexWriter(directory, _analyzer, true, new IndexWriter.MaxFieldLength(IndexWriter.DEFAULT_MAX_FIELD_LENGTH)))
{
    using (var textReader = new StreamReader(blobStream))
    {
        // this code should analyze and write my indexes to the lucene instance

        var text = await textReader.ReadToEndAsync();
        var document = new Document();
        document.Add(new Field("Text", text, Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS));
        document.Add(new Field("DocId", docId.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
        document.Add(new Field("FamilyId", familyId.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
        indexWriter.AddDocument(document);
    }
}

Lucene originally generates a lot of documents, but then deletes all but the .cfs file. How do I keep the other files to do my queries?

1

1 Answers

0
votes

To index on arbitrary char positions, you can use the NGramTokenizer. While creatiung the Index, you should also use FieldType.setStoreTermVectors(true); and FieldType.setStoreTermVectorPositions(true); so that the positions of the terms are actually stored. Have a look at this question that has the correct code for retrieving the term positions already in the question.