1
votes

The example in this question and some others I've seen on the web use postings method of a TermVector to get terms positions. Copy paste from the example in the linked question:

IndexReader ir = obtainIndexReader();
Terms tv = ir.getTermVector( doc, field );
TermsEnum terms = tv.iterator();
PostingsEnum p = null;
while( terms.next() != null ) {
    p = terms.postings( p, PostingsEnum.ALL );
    while( p.nextDoc() != PostingsEnum.NO_MORE_DOCS ) {
        int freq = p.freq();
        for( int i = 0; i < freq; i++ ) {
            int pos = p.nextPosition();   // Always returns -1!!!
            BytesRef data = p.getPayload();
            doStuff( freq, pos, data ); // Fails miserably, of course.
        }
    }
}

This code works for me but what drives me mad is that the Terms type is where the position information is kept. All the documentation I've seen keep saying that term vectors keep position data. However, there are no methods on this type to get that information!

Older versions of Lucene apparently had a method but as of at least version 6.5.1 of Lucene, that is not the case.

Instead I'm supposed to use postings method and traverse the documents but I already know which document I want to work on!

The API documentation does not say anything about postings returning only the current document (the one the term vector belongs to) but when I run it, I only get the current doc.

Is this the correct and only way to get position data from term vectors? Why such an unintuitive API? Is there a document that explains why the previous approach changed in favour of this?

1

1 Answers

2
votes

Don't know about "right or wrong" but for version 6.6.3 this seems to work.

private void run() throws Exception {
    Directory directory = new RAMDirectory();
    IndexWriterConfig indexWriterConfig = new IndexWriterConfig(new StandardAnalyzer());
    IndexWriter writer = new IndexWriter(directory, indexWriterConfig);

    Document doc = new Document();
    // Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.YES
    FieldType type = new FieldType();
    type.setStoreTermVectors(true);
    type.setStoreTermVectorPositions(true);
    type.setStoreTermVectorOffsets(true);
    type.setIndexOptions(IndexOptions.DOCS);

    Field fieldStore = new Field("tags", "foo bar and then some", type);
    doc.add(fieldStore);
    writer.addDocument(doc);
    writer.close();

    DirectoryReader reader = DirectoryReader.open(directory);
    IndexSearcher searcher = new IndexSearcher(reader);

    Term t = new Term("tags", "bar");
    Query q = new TermQuery(t);
    TopDocs results = searcher.search(q, 1);

    for ( ScoreDoc scoreDoc: results.scoreDocs ) {
        Fields termVs = reader.getTermVectors(scoreDoc.doc);
        Terms f = termVs.terms("tags");
        TermsEnum te = f.iterator();
        PostingsEnum docsAndPosEnum = null;
        BytesRef bytesRef;
        while ( (bytesRef = te.next()) != null ) {
            docsAndPosEnum = te.postings(docsAndPosEnum, PostingsEnum.ALL);
            // for each term (iterator next) in this field (field)
            // iterate over the docs (should only be one)
            int nextDoc = docsAndPosEnum.nextDoc();
            assert nextDoc != DocIdSetIterator.NO_MORE_DOCS;
            final int fr = docsAndPosEnum.freq();
            final int p = docsAndPosEnum.nextPosition();
            final int o = docsAndPosEnum.startOffset();
            System.out.println("p="+ p + ", o=" + o + ", l=" + bytesRef.length + ", f=" + fr + ", s=" + bytesRef.utf8ToString());
        }
    }
}