I am trying to get all of terms and related postings which called Terms
from a Lucene`s document field(i.e. How to calculate term frequeny in Lucene?). According to documentation there is a method to do that:
public final Terms getTermVector​(int docID, String field) throws IOException
Retrieve term vector for this document and field, or null if term vectors were not indexed. The returned Fields instance acts like a single-document inverted index (the docID will be 0).
There is a field called int docID
. What is this?? for a given document what is the id field of that and how does Lucene recognize that?
According to Lucene's documentation i have used StringField
as id and it is not a int
.
import org.apache.lucene.document.*;
Document doc = new Document();
Field idField = new StringField("id",post.Id,Field.Store.YES);
Field bodyField = new TextField("body", post.Body, Field.Store.YES);
doc.add(idField);
doc.add(bodyField);
I have five question accordingly:
- How does Lucene recognize the
id
field is used asdocId
for this document? or even Lucene does it or not ?? - I used
String
for id but this method give aint
. Does it cause a problem? - Is there any appropriate method to get postings?
- I have used
TextField
. Is there any way to retrieve term vector(Terms
) of that field? I don't want to re-index my doc as explained here, because it is too large (35-GB). - Is there any way to get terms count and get each term frequency from
TextField
?