I am trying to calculate the total number of times a particular term occurs throughout an entire index (term collection frequency). I have attempted to do so through the use of term vectors, however this is restricted to a single document. Even in the case of terms that exist within a specified document, the response seems to max out at a certain doc_count (within field_statistics) which makes me doubtful of its accuracy.
Request:
http://myip:9200/clinicaltrials/trial/AVmk-ky6XMskTDwIwpih/_termvectors?term_statistics=true
The document id being used here is "AVmk-ky6XMskTDwIwpih", although the term statistics should not be specific to a document.
Response:
This is what I get for the term "cancer" for one of the fields:
"cancer" : {
"doc_freq" : 5297,
"ttf" : 10587,
"term_freq" : 1,
"tokens" : [
{
"position" : 15,
"start_offset" : 115,
"end_offset" : 121
}
]
},
If I total the ttf for all fields, I get 18915. However, the actual total term frequency for "cancer" is in fact 542829. This leads me to believe that it is limiting the term_vector stats to a subset of documents within the index.
Any advice here would be greatly appreciated.