2
votes

I've been struggling with a problem for a while now, so i thought i would swing this by stackoverflow.

My document type has a title, a language field (used to filter) and a grouping id field (im leaving out all the other fields to keep this to the point)

When i search for documents i want to find all documents containing the text in the title. I only want one document for each unique grouping id.

I've been looking at tophits aggregation, and from what i can see it should be able to solve my problem.

When running this query against my index:

{
  "query": {
    "match": {
      "title": "dingo"
    }
  },
  "aggs": {
    "top-tags": {
      "terms": {
        "field": "groupId",
        "size": 1000000
      },
      "aggs": {
        "top_tag_hits": {
          "top_hits": {
            "_source": {
              "include": [
                "*"
              ]
            },
            "size": 1
          }
        }
      }
    }
  }
}

I get the following response (All results are in the same language):

{
    "took": 9,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 3,
        "max_score": 0,
        "hits": []
    },
    "aggregations": {
        "top-tags": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [{
                "key": "3044BC9E7C29450AAB2E4B6C9B35AAE2",
                "doc_count": 2,
                "top_tag_hits": {
                    "hits": {
                        "total": 2,
                        "max_score": 1.4983996,
                        "hits": [{
                            "_index": "elasticsearch",
                            "_type": "productdocument",
                            "_id": "FB15279FB18E4B34AD66ACAF69B96E9E",
                            "_score": 1.4983996,
                            "_source": {
                                "groupId": "3044BC9E7C29450AAB2E4B6C9B35AAE2",
                                "title": "wombat, dingo and zetapunga actionfigures",

                            }
                        }]
                    }
                }
            },
            {
                "key": "F11799ABD0C14B98ADF2554C84FF0DA0",
                "doc_count": 1,
                "top_tag_hits": {
                    "hits": {
                        "total": 1,
                        "max_score": 1.30684,
                        "hits": [{
                            "_index": "elasticsearch",
                            "_type": "productdocument",
                            "_id": "42562A25E4434A0091DE0C79A3E7F3F4",
                            "_score": 1.30684,
                            "_source": {
                                "groupId": "F11799ABD0C14B98ADF2554C84FF0DA0",
                                "title": "awesome dingo raptor"
                            }
                        }]
                    }
                }
            }]
        }
    }
}

This is exactly what i expected (two hits in one bucket, but only one document retrieved for that bucket). However when i try this in NEST i can't seem to retrieve all of the documents.

My query looks like this:

result = _elasticClient.Search<T>(s => s
                .From(skip)
                .Filter(fd => fd.Term(f => f.Language, language))
                .Size(pageSize)
                .SearchType(SearchType.Count)
                .Query(
                    q => q.Wildcard(f => f.Title, query, 2.0)
                         || q.Wildcard(f => f.Description, query)
                )
                .Aggregations(agd =>
                    agd.Terms("groupId", tagd => tagd
                        .Field("groupId")
                        .Size(100000) //We sadly need all products
                    )
                    .TopHits("top_tag_hits", thagd => thagd
                        .Size(1)
                        .Source(ssd => ssd.Include("*")))
                ));

var topHits = result.Aggs.TopHits("top_tag_hits");
var documents = topHits.Documents<ProductDocument>(); //contains only one document (I would expect it to contain two, one for each bucket)

Inspecting the aggregations in the debugger reveals there is a "groupId" aggregation with 2 buckets (and matching what i see in my "raw" query against the index. Just without any apparent way to retrieve the documents)

So my question is. How do i retrieve the top hit for each bucket? Or am i doing this completely wrong? Is there some other way to achieve what i am trying to do?

EDIT

After the help i received, i was able to retrieve my results with the following:

result = _elasticClient.Search<T>(s => s
                .From(skip)
                .Filter(fd => fd.Term(f => f.Language, language))
                .Size(pageSize)
                .SearchType(SearchType.Count)
                .Query(
                    q => q.Wildcard(f => f.Title, query, 2.0)
                         || q.Wildcard(f => f.Description, query)
                )
                .Aggregations(agd =>
                    agd.Terms("groupId", tagd => tagd
                        .Field("groupId")
                        .Size(0)
                    .Aggregations(tagdaggs =>
                        tagdaggs.TopHits("top_tag_hits", thagd => thagd
                            .Size(1)))
                    )
                )
                );

                var groupIdAggregation = result.Aggs.Terms("groupId");

                var topHits =
                    groupIdAggregation.Items.Select(key => key.TopHits("top_tag_hits"))
                        .SelectMany(topHitMetric => topHitMetric.Documents<ProductDocument>()).ToList();
1
Don't you have to add .Aggregations after closing your Terms aggregation?Evaldas Buinauskas
@EvaldasBuinauskas I'm not sure i understand what you mean. Do you mean that my TopHits aggregation should be in its own separate .Aggregation (after the first one) I tried it, and now i only get one tophits aggregation. Still with only one documentCoolMcGrrr
I mean that each Terms aggregation should have its' own TopHits aggregation. You're doing that too in your RAW query as well.Evaldas Buinauskas
@EvaldasBuinauskas you are correct. I moved the aggregation into the terms aggregation and now i'm able to retrieve the results as expected. Thanks man! If you make an answer i will make sure to credit you. :)CoolMcGrrr
By the way, you don't have to use Include("*") to include all fields. Just remove this option. And specifying .Size(0) should bring back ALL possible terms for you. I'll add this to answer.Evaldas Buinauskas

1 Answers

3
votes

Your NEST query tries to run both Terms aggregation and TopHits side by side, while your original query runs Terms first and then for each bucket, you're calling TopHits.

You simply have to move your TopHits agg into Terms in your NEST query to make it work fine.

This should fix it:

.Aggregations(agd =>
    agd.Terms("groupId", tagd => tagd
        .Field("groupId")
        .Size(0)
        .Aggregations(tagdaggs =>
            tagdaggs.TopHits("top_tag_hits", thagd => thagd
                .Size(1)))
    )
)

By the way, you don't have to use Include("*") to include all fields. Just remove this option, also specifying .Size(0) should bring back ALL possible terms for you.