1
votes

I am trying to get total term frequency and document count from given set of documents, but _termvectors in elasticsearch returns ttf and doc_count from all documents within the index. Is there any way so that I can specify list of documents (document ids) so that result will based on those documents only.

Below are documents details and query to get total term frequency:

Index details:

PUT /twitter
{ "mappings": {
    "tweets": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer":"english"
        }
      }
    }
  },
  "settings" : {
    "index" : {
      "number_of_shards" : 1,
      "number_of_replicas" : 0
    }
  }
}

Document Details:

PUT /twitter/tweets/1
{
  "name":"Hello bar"
}

PUT /twitter/tweets/2
{
  "name":"Hello foo"
}

PUT /twitter/tweets/3
{
  "name":"Hello foo bar"
}

It will create three document with ids 1, 2 and 3. Now suppose tweets with ids 1 and 2 belongs to user1 and 3 belong to another user and I want to get the termvectors for user1.

Query to get this result:

GET /twitter/tweets/_mtermvectors
{
  "ids" : ["1", "2"],
  "parameters": {
      "fields": ["name"],
      "term_statistics": true,
      "offsets":false,
      "payloads":false,
      "positions":false
  }
}

Response:

    {
  "docs": [
    {
      "_index": "twitter",
      "_type": "tweets",
      "_id": "1",
      "_version": 1,
      "found": true,
      "took": 1,
      "term_vectors": {
        "name": {
          "field_statistics": {
            "sum_doc_freq": 7,
            "doc_count": 3,
            "sum_ttf": 7
          },
          "terms": {
            "bar": {
              "doc_freq": 2,
              "ttf": 2,
              "term_freq": 1
            },
            "hello": {
              "doc_freq": 3,
              "ttf": 3,
              "term_freq": 1
            }
          }
        }
      }
    },
    {
      "_index": "twitter",
      "_type": "tweets",
      "_id": "2",
      "_version": 1,
      "found": true,
      "took": 1,
      "term_vectors": {
        "name": {
          "field_statistics": {
            "sum_doc_freq": 7,
            "doc_count": 3,
            "sum_ttf": 7
          },
          "terms": {
            "foo": {
              "doc_freq": 2,
              "ttf": 2,
              "term_freq": 1
            },
            "hello": {
              "doc_freq": 3,
              "ttf": 3,
              "term_freq": 1
            }
          }
        }
      }
    }
  ]
}

Here we can see hello is having doc_count 3 and ttf 3. How can I make it to consider only documents with given ids.

One approach I am thinking is to create different index for different users. But I am not sure if this approach is correct. With this approach indices will increase with users. Or can there be another solution?

1

1 Answers

2
votes

To obtain term doc count on a subset of documents you may try to use simple aggregations.

You will have to enable fielddata in the mapping of the field (though it might become tough on memory, check out the documentation page about fielddata for more details):

PUT /twitter
{ 
  "mappings": {
    "tweets": {
      "properties": {
        "name": {
          "type": "text",
          "analyzer":"english",
          "fielddata": true,
          "term_vector": "yes"
        }
      }
    }
  }
}

Then use terms aggregation:

POST /twitter/tweets/_search
{
  "size": 0,
  "query": {
    "terms": {
      "_id": [
        "1",
        "2"
      ]
    }
  },
  "aggs": {
    "my_term_doc_count": {
      "terms": {
        "field": "name"
      }
    }
  }
}

The response will be:

{
  "hits": ...,
  "aggregations": {
    "my_term_doc_count": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "hello",
          "doc_count": 2
        },
        {
          "key": "bar",
          "doc_count": 1
        },
        {
          "key": "foo",
          "doc_count": 1
        }
      ]
    }
  }
}

I couldn't find a way to calculate total_term_frequency on the subset of documents though, I'm afraid it can't be done.

I would suggest to compute term vectors offline with _analyze API and store them in a separate index explicitly. In this way you will be able to use simple aggregations to compute also total term frequency. Here I show an example usage of _analyze API.

POST twitter/_analyze
{
  "text": "Hello foo bar"
}

{
  "tokens": [
    {
      "token": "hello",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "foo",
      "start_offset": 6,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "bar",
      "start_offset": 10,
      "end_offset": 13,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

Hope that helps!