0
votes

I have an index that collects web redirects data for various sites. I am using a nested field to collect the data as shown in the mapping below:

"chain": {
    "type": "nested",
    "properties": {
      "url.position": {
        "type": "long"
      },
      "url.full": {
        "type": "text"
      },
      "url.domain": {
        "type": "keyword"
      },
      "url.path": {
        "type": "keyword"
      },
      "url.query": {
        "type": "text"
      }
    }
  }

As you can imagine, each document contains an array of url chains, the size of the array being equal to number of web redirects. I want to get aggregations based on wildcard/regexp matches to url.query field. Here is a sample query:

GET push_url_chain/_search
{
  "query": {
    "nested": {
      "path": "chain",
       "query": {
          "regexp": {
            "chain.url.query": "aff_c.*"
        }
      }
    }
 },
 "size": 0,
 "aggs": {
   "dataFields": {
      "nested": {
        "path": "chain"
      },
      "aggs": {
        "offers": {
          "terms": {
             "field": "chain.url.domain",
             "size": 30
           }
         }
       }
     }
    }
   }

The above query does produce aggregated results but not the way I want. I want to see chain.url.domain aggregations for the urls that contain the aff_c.* phrase. Right now it is looking at all the urls in the chain and then aggregating the buckets by doc_count regardless of whether that url/domain has the particular phrase. I hope I have been able to explain this clearly. How do I get my results to show bucket aggregations that contain domains that have aff_c.* phrase match to the query field of the url.

I would also like to know how I can use = or / in my wildcard or regexp queries. It is not producing any results if I use the above symbols in my queries.

Tha

1

1 Answers

1
votes

Nested query returns all documents where a nested document matches the condition, you get matched nested docs only in inner_hits. Aggregation is applied on top of these documents, so all domains are coming in terms

You need to use nested aggregation to gets only matching terms.

{
  "size": 0, 
  "aggs": {
    "Name": {
      "nested": {
        "path": "chain"
      },
      "aggs": {
        "matched_doc": {
          "filter": {   --> filter for url
              "match_phrase_prefix": {
                "chain.url.query": "abc"
            }
          },
          "aggs": {
            "domain": {
              "terms": {
                "field": "chain.url.domain", -- terms for matched url
                "size": 10
              }
            }
          }
        }
      }
    }
  }
}

You can use match_phrase_prefix instead of regex. It has better performance.

Standard analyzer while generating tokens removes "/","=". So if you want to use regex or wildcard and look for these , you need to use keyword field not text field.