0
votes

Is there a way to read the term vector of a document along with the positions of each term?

During the creation of the index I am enabling the positions, freq etc

        FieldType fieldType = new FieldType();
        fieldType.setStoreTermVectors(true);
        fieldType.setStoreTermVectorPositions(true);
        fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);
        fieldType.setStored(true);

while reading the search index, I am getting the Termvector using

Terms termVector=indexReader.getTermVector(docId, "content"); TermsEnum termsEnum = termVector.iterator();

The termsEnum seems to be unpositioned and I am not sure how to get the position value for each term of a document.

Appreciate anyone's help on this.

1

1 Answers

0
votes

I think TermPositionVector and a little downcast may solve your problem. My lucene version is 3.6.2. The following code is written with Scala.

Assuming, you have "we are family we not love" in contents field of one document and we match the document successfully, then we begin to get every term with position.

val topDocs = iSearch.search("some query", 1).scoreDocs.toList

topDocs.foreach { matched =>

     val termVectors = indexReader.getTermFreqVector(matched.doc, "contents")
     // The field is added in document with TermVector.WITH_POSITIONS_OFFSETS,
     // better write some try..catch to make this more robust
     val tpvector = termVectors.asInstanceOf[TermPositionVector]

     val termAndPosition = termVectors.getTerms.toList.map { term =>
        val indexOfTerm = termVectors.indexOf(term)

        //Returns an array of positions in which the term is found
        term -> tpvector.getTermPositions(indexOfTerm).toList
     }

     // Map(family -> List(2), love -> List(5), we -> List(0, 3))
     println(termAndPosition.toMap)

}

Basically, the term are will be omitted during indexing cause it's a stop word. The returned map actually makes sense, the term we appears in the position 0 and 3. If you want to get the OffSet, then the getOffsets method in TermPositionVector is for your use.

Anyway, hope it helps.