0
votes

From what I understand, the demo IndexFiles example in the Lucene contributions directory will create an inverted index from document terms to the corresponding document pathnames.

I was wondering if there was a way to add the term frequency in each document to the index as well.

In other words (if I understand this right), the original mapping: term -> list of(pathname of documents) term -> list of(pathname of document, term frequency in that document)

Is there a way to achieve this? Currently, I am counting the term frequency on the fly by opening each document pathname in java, then counting the terms. There is some huge overhead since there are potentially hundreds of documents to open and process.

1

1 Answers

0
votes

Lucene generally does store the term frequencies, and can also store the term offsets and positions. The frequency info is stored in a file with the extension "frq," so if you have that in your index folder, you are storing term frequencies.

You didn't say why you care, or what you want to do with the frequencies. Usually Lucene uses them to compute relevance scores for your queries. If you want the raw frequencies, this other question discusses how to retrieve them: Get term frequencies in Lucene