0
votes

The below is the records in my test-data index and using the elastic search version 5.6.

[
  {
    "_index": "test-data",
    "_type": "log",
    "_id": "123",
    "_score": 2,
    "_source": {
      "request": "/test-url/poll?request_ids=1",
      "user": "test1"
    }
  },
  {
    "_index": "test-data",
    "_type": "log",
    "_id": "126",
    "_score": 2,
    "_source": {
      "request": "/test-url/poll?request_ids=2",
      "user": "test1"
    }
  },
  {
    "_index": "test-data",
    "_type": "log",
    "_id": "124",
    "_score": 2,
    "_source": {
      "request": "/test-url/poll?request_ids=2",
      "user": "test1"
    }
  },
  {
    "_index": "test-data",
    "_type": "log",
    "_id": "125",
    "_score": 2,
    "_source": {
      "request": "/test-url/poll?request_ids=2",
      "user": "test1"
    }
  },
  {
    "_index": "test-data",
    "_type": "log",
    "_id": "128",
    "_score": 2,
    "_source": {
      "request": "/test-url/poll?request_ids=2",
      "user": "test2"
    }
  }
]

I need to find the number of distinct records which are having the unique combination of request and user and tried the below query. I expect 3 as the result, but getting 5.

{
  "query": {
    "bool": {
      "must": [
        {
          "exists": {
            "field": "request"
          }
        },
        {
          "regexp": {
            "request.keyword": "/test-url/poll\\?request_ids=.*"
          }
        }
      ]
    }
  },
  "_source": ["request.keyword", "user.keyword","request", "user"], 
  "aggs": {
    "request_count": {
          "cardinality": {
            "script": {
              "lang": "painless", 
              "source": "[doc['request.keyword'], doc['user.keyword']]"
            }
          }
        }
  }
}

Can somebody suggest what is wrong with the query or some other option to solve this issue?

1

1 Answers

0
votes

I think you should try the following:

“[doc['request.keyword'].value + ' ' + doc['user.keyword']].value”

This would calculate the hashes of the field, which would be a concatenated string from a two values - request and user

Caveat - this would be a significant performance hit, since it’s calculating and extracting field values on the fly

One of the possibility to avoid this would be to alter your indexing process to create this synthetic field being a concatenation, so later you could use normal cardinality aggregation, not the script one.