1
votes

Questions like this have been asked lots (e.g. here, here, here, ...) and my inability to get what I need from those answers may just be me not understanding what Lucene means by "term" or "termdoc".

I build a Lucene index thus:

var db = new DataClassesDataContext();
var articles = (from article in db.Articles
                orderby article.articleID ascending
                select article).ToList();
var analyzer = new StandardAnalyzer(Version.LUCENE_30);
using (var writer = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED))
{
    foreach (var article in articles)
    {
        var luceneDocument = new Document();
        luceneDocument.Add(new Field("ArticleID", article.articleID.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
        luceneDocument.Add(new Field("Title", article.title, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
        luceneDocument.Add(new Field("Paragraph", article.paragraph, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
        writer.AddDocument(luceneDocument);
    }
    Console.WriteLine("Optimizing index.");
    writer.Optimize();
}

This works well and I can retrieve any term frequency vector. For example

var titleVector = indexReader.GetTermFreqVector(5001, "Title");

gives the result {Title: doing/1, healthcare/1, right/1}. But I would like to enumerate the inverted index that maps words (like "doing", "healthcare", and "right") to the id's of the documents whose titles contain each word. I would like to build a CSV file where each row is something like word, ArticleID_1, ArticleID_2, ... , ArticleID_n

What I have so far doesn't work (it spits out all terms):

var terms = indexReader.Terms();
while (terms.Next())
{
    Console.WriteLine(terms.Term.Text);
}

How do I get the list of all words that the index is using as terms from the "Title" field in my documents? I.e. how do I restrict that last code snippet to Title field terms only?

1
Be advised that Lucene document IDs are not persistent and can/will change between index runs. If you require a reference ID that will remain consistent, you'll need to feed it to Lucene and maintain it as your index evolves.M.Babcock
Thanks @M.Babcock I'm passing in ArticleID as a document field so I'll use that.dumbledad
hmm the performance of this would be pretty slow for a large index. You are at some point reading from disk when you load the document with indexreader. Would have been nice if there was a way to get the current value without needing to load the document or read from disk.Peter

1 Answers

1
votes

Typical, no sooner had I written down the question than an answer formulated!

var terms = indexReader.Terms();
while (terms.Next())
{
    if (terms.Term.Field == "Title")
    {
        var row = "\"" + terms.Term.Text + "\", ";
        var termDocs = indexReader.TermDocs(terms.Term);
        while (termDocs.Next())
        {
            row += indexReader[termDocs.Doc].Get("ArticleID") + ", ";
        }
        row.TrimEnd(new char[] { ',', ' ' });
        titleFile.WriteLine(row);
    }
}