Questions like this have been asked lots (e.g. here, here, here, ...) and my inability to get what I need from those answers may just be me not understanding what Lucene means by "term" or "termdoc".
I build a Lucene index thus:
var db = new DataClassesDataContext();
var articles = (from article in db.Articles
orderby article.articleID ascending
select article).ToList();
var analyzer = new StandardAnalyzer(Version.LUCENE_30);
using (var writer = new IndexWriter(directory, analyzer, true, IndexWriter.MaxFieldLength.UNLIMITED))
{
foreach (var article in articles)
{
var luceneDocument = new Document();
luceneDocument.Add(new Field("ArticleID", article.articleID.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
luceneDocument.Add(new Field("Title", article.title, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
luceneDocument.Add(new Field("Paragraph", article.paragraph, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES));
writer.AddDocument(luceneDocument);
}
Console.WriteLine("Optimizing index.");
writer.Optimize();
}
This works well and I can retrieve any term frequency vector. For example
var titleVector = indexReader.GetTermFreqVector(5001, "Title");
gives the result {Title: doing/1, healthcare/1, right/1}
. But I would like to enumerate the inverted index that maps words (like "doing", "healthcare", and "right") to the id's of the documents whose titles contain each word. I would like to build a CSV file where each row is something like word, ArticleID_1, ArticleID_2, ... , ArticleID_n
What I have so far doesn't work (it spits out all terms):
var terms = indexReader.Terms();
while (terms.Next())
{
Console.WriteLine(terms.Term.Text);
}
How do I get the list of all words that the index is using as terms from the "Title" field in my documents? I.e. how do I restrict that last code snippet to Title field terms only?