What is docID in IndexReader.getTermVector(int docID ,String field) in Lucene 8.5.1 and how does it work?

Question

I am trying to get all of terms and related postings which called Terms from a Lucene`s document field(i.e. How to calculate term frequeny in Lucene?). According to documentation there is a method to do that:

public final Terms getTermVector(int docID, String field) throws IOException

Retrieve term vector for this document and field, or null if term vectors were not indexed. The returned Fields instance acts like a single-document inverted index (the docID will be 0).

There is a field called int docID. What is this?? for a given document what is the id field of that and how does Lucene recognize that? According to Lucene's documentation i have used StringField as id and it is not a int.

import org.apache.lucene.document.*;
Document doc = new Document();
Field idField = new StringField("id",post.Id,Field.Store.YES);
Field bodyField = new TextField("body", post.Body, Field.Store.YES);
doc.add(idField);
doc.add(bodyField);

I have five question accordingly:

How does Lucene recognize the id field is used as docId for this document? or even Lucene does it or not ??
I used String for id but this method give a int. Does it cause a problem?
Is there any appropriate method to get postings?
I have used TextField . Is there any way to retrieve term vector(Terms) of that field? I don't want to re-index my doc as explained here, because it is too large (35-GB).
Is there any way to get terms count and get each term frequency from TextField?

Hamed Sanaei Hamed Sanaei · Accepted Answer · 2020-06-18T07:05:37

To calculate term frequency we can use IndexReader.getTermVector(int docID ,String field). int docID is a field which refers to document id created by Lucene. You can retrieve docID by the code follow:

String index = "index/AIndex/";
String query = "the query text"

IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(index)));
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer();

QueryParser parser = new QueryParser("docField", analyzer);
Query lQuery = parser.parse(query);

]TopDocs results = searcher.search(lQuery ,  requiredHits);
ScoreDoc[] hits = results.scoreDocs;
int numTotalHits = (int) results.totalHits.value;

for (int i = start; i < numTotalHits; i++)
 {
   int docID = hits[i].doc;
   Terms termVector = reader.getTermVector(docID, "docField");
 }

Each termVector object have term and frequency related to a document field and you can retrieve that by the following code:

private HashMap<String,Long> termsFrequency = new HashMap<>();
TermsEnum itr = termVector.iterator();
int allTermFrequency=0;
BytesRef term;

while ((term = itr.next()) != null){
  String termText = term.utf8ToString();
  long tf = itr.totalTermFreq();
  termsFrequency.put(termText, tf);
  allTermFrequency += itr.totalTermFreq();
}

Note: Don't forget to set store term vector as i explained here (Or this one) when you are indexing documents. If you index your document without setting to store term vector, the method getTermVector will return null. All kind of predefind Lucene Field deactivated this option by default. So you need to set it.

What is docID in IndexReader.getTermVector(int docID ,String field) in Lucene 8.5.1 and how does it work?

1 Answers