15
votes

I am trying to calculate the total number of times a particular term occurs throughout an entire index (term collection frequency). I have attempted to do so through the use of term vectors, however this is restricted to a single document. Even in the case of terms that exist within a specified document, the response seems to max out at a certain doc_count (within field_statistics) which makes me doubtful of its accuracy.

Request:

http://myip:9200/clinicaltrials/trial/AVmk-ky6XMskTDwIwpih/_termvectors?term_statistics=true

The document id being used here is "AVmk-ky6XMskTDwIwpih", although the term statistics should not be specific to a document.

Response:

This is what I get for the term "cancer" for one of the fields:

 "cancer" : {
      "doc_freq" : 5297,
      "ttf" : 10587,
      "term_freq" : 1,
      "tokens" : [
        {
          "position" : 15,
          "start_offset" : 115,
          "end_offset" : 121
        }
      ]
    },

If I total the ttf for all fields, I get 18915. However, the actual total term frequency for "cancer" is in fact 542829. This leads me to believe that it is limiting the term_vector stats to a subset of documents within the index.

Any advice here would be greatly appreciated.

3
what elasticsearch version are you using?xecgr

3 Answers

6
votes

The reason for the difference in the count is because term vectors are not accurate unless the index in question has a single shard. For indexes with multiple shards, the documents are distributed all over the shards, hence the frequency returned isn't the total but from a randomly selected shard.

Thus, the returned frequency is just a relative measure and not the absolute value you expect. see the Behaviour section. To test this, you can create a single shard index and request the frequency (it should give you the actual total).

6
votes

I believe you need to turn term_statistics to true as per elasticsearch documentation:

Term statistics Setting term_statistics to true (default is false) will return

total term frequency (how often a term occurs in all documents)

document frequency (the number of documents containing the current term)

By default these values are not returned since term statistics can have a serious performance impact.

-1
votes

Have you tried simply using COUNT API? https://www.elastic.co/guide/en/elasticsearch/reference/7.6/search-count.html

It can return the number of matches for a query. So something like this may work.

GET /my_index/_count
{
    "query" : {"match": {"my_field": "my_keyword"}
}