1
votes

I am migrating my code from Lucene 3.5 to Lucene 4.1 but I am having some problems with getting the term vector without indexing.

The problem is, given a text string and an Analyzer, I need to compute the term vector (technically, find the terms and their frequencies tf). Obviously, it can be achieved by writing the index (using IndexWriter) and then reading them back (using IndexReader) but I reckon it would be expensive. Furthermore, I don't need document frequency (df). Thus, I think an indexing-free solution is suitable.

In Lucene 2 and 3, a simple technique for the above purpose is to use QueryTermVector which extends TermFreqVector and has a constructor taking a string and an Analyzer. Unfortunately, QueryTermVector (along with TermFreqVector) has been removed in Lucene 4 and it seems the migration documentation did not mention anything about QueryTermVector.

Do you have a solution for this problem in Lucene 4? Thank you very much.

1

1 Answers

0
votes

If you just need to know the terms/frequencies, you can just obtain the single tokens directly from the analyzer (you can get the TF by counting them, e.g. by using a Map or a Multiset).

This is how you do it in Lucene 4.0:

TokenStream ts = analyzer.tokenStream(field, new StringReader(text));
CharTermAttribute charTermAttribute = ts.addAttribute(CharTermAttribute.class);


while (ts.incrementToken()) {
    String term = charTermAttribute.toString();
    //term contains your token
}