8
votes

As far as I'm aware, there isn't a way to do something like the following in Elasticsearch:

SELECT * FROM myindex
GROUP BY agg_field1, agg_field2, agg_field3 // aggregation
ORDER BY order_field1, order_field2, order_field3 // sort
LIMIT 1000, 5000 // paginate -- get page 6 of size 1000 records

Here are some related documents regarding this:

Is there a way to do the above in Elasticsearch? The one limitation we have is we will never have more than 10M records, so we (hopefully) shouldn't run into memory errors. My thinking was to do it as follows:

  • Do an aggregation query
  • Get the number of results from it
  • Split it into N segments based on the results and page size we want
  • Rerun the query with the above segments

What would be the best way to accomplish this? In your answer/suggestion, could you please post some sample code relating to how the above SQL query could be done in ES?


As an update to this question, here is a public index to test with:

# 5.6
e=Elasticsearch('https://search-testinges-fekocjpedql2f3rneuagyukvy4.us-west-1.es.amazonaws.com')
e.search('testindex')

# 6.4 (same data as above)
e = Elasticsearch('https://search-testinges6-fycj5kjd7l5uyo6npycuashch4.us-west-1.es.amazonaws.com')
e.search('testindex6')

It has 10,000 records. Feel free to test with it:

enter image description here

The query that I'm looking to do is as follows (in sql):

SELECT * FROM testindex
GROUP BY store_url, status, title
ORDER BY title ASC, status DESC
LIMIT 100 OFFSET 6000

In other words, I'm looking to sort an aggregated result (with multiple aggregations) and get an offset.

2
As of ES 6.1 there's a new aggregation called composite that might help you achieve this, but since you're running 5.6.8, it's not there, unfortunately. Moreover, grouping by title won't be possible as title is of type text and not keyword. - Val
Moreover, I don't see how it makes sense to sort by other fields than the ones you group by. I mean, just in the store_url group you're going to have several different retailer_id values and the one showing up might not necessarily represent all same stores. Or am I missing something obvious here? - Val
@Val -- thanks for the feedback, I've updated the index so that's a keyword type. I'll change the sort so it makes more sense. - user10332687
@Val I've also created a new index with the same data that's ES 6.4 if you want to try it out (see updated question). - user10332687
Thanks, but there are no docs in testindex6. I've reindexed testindex into testindex6 - Val

2 Answers

3
votes

The composite aggregation might help here as it allows you to group by multiple fields and then paginate over the results. The only thing that it doesn't let you do is to jump at a given offset, but you can do that by iterating from your client code if at all necessary.

So here is a sample query to do that:

POST testindex6/_search
{
  "size": 0,
  "aggs": {
    "my_buckets": {
      "composite": {
        "size": 100,
        "sources": [
          {
            "store": {
              "terms": {
                "field": "store_url"
              }
            }
          },
          {
            "status": {
              "terms": {
                "field": "status",
                "order": "desc"
              }
            }
          },
          {
            "title": {
              "terms": {
                "field": "title",
                "order": "asc"
              }
            }
          }
        ]
      },
      "aggs": {
        "hits": {
          "top_hits": {
            "size": 100
          }
        }
      }
    }
  }
}

In the response you'll see and after_key structure:

  "after_key": {
    "store": "http://google.com1087",
    "status": "OK1087",
    "title": "Titanic1087"
  },

It's some kind of cursor that you need to use in your subsequent queries, like this:

{
  "size": 0,
  "aggs": {
    "my_buckets": {
      "composite": {
        "size": 100,
        "sources": [
          {
            "store": {
              "terms": {
                "field": "store_url"
              }
            }
          },
          {
            "status": {
              "terms": {
                "field": "status",
                "order": "desc"
              }
            }
          },
          {
            "title": {
              "terms": {
                "field": "title",
                "order": "asc"
              }
            }
          }
        ],
        "after": {
          "store": "http://google.com1087",
          "status": "OK1087",
          "title": "Titanic1087"
        }
      },
      "aggs": {
        "hits": {
          "top_hits": {
            "size": 100
          }
        }
      }
    }
  }
}

And it will give you the next 100 buckets. Hopefully this helps.

UPDATE:

If you want to know how many buckets in total there is going to be, the composite aggregation won't give you that number. However, since the composite aggregation is nothing else than a cartesian product of all the fields in its sources, you can get a good approximation of that total number by also returning the ]cardinality](https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html) of each field used in the composite aggregation and multiplying them together.

  "aggs": {
    "my_buckets": {
      "composite": {
        ...
      }
    },
    "store_cardinality": {
      "cardinality": {
        "field": "store_url"
      }
    },
    "status_cardinality": {
      "cardinality": {
        "field": "status"
      }
    },
    "title_cardinality": {
      "cardinality": {
        "field": "title"
      }
    }
  }

We can then get the total number of buckets by multiplying the figure we get in store_cardinality, status_cardinality and title_cardinality together, or at least a good approximation thereof (it won't work well on high-cardinality fields, but pretty well on low-cardinality ones).

3
votes

Field collapsing is the answer.

Field collapsing feature is used when we want to group the hits on a specific field (as in group by agg_field).

Before Elastic 6, the way to group the fields is to use aggregation. This approach was lacking an ability to do efficient paging.

But now, with the field collapse provided out of the box by elastic, it is pretty easy.

Below is a sample query with field collapse taken from above link.

GET /twitter/_search
{
  "query": {
      "match": {
          "message": "elasticsearch"
      }
  },
  "collapse" : {
      "field" : "user", 
      "inner_hits": {
          "name": "last_tweets", 
          "size": 5, 
          "sort": [{ "date": "asc" }] 
      },
      "max_concurrent_group_searches": 4 
  },
  "sort": ["likes"]

}