ElasticSearch terms and cardinality performance with high cardinality fields

Question

TL;DR

My ElasticSearch query takes forever compared to the same query on SQL Server.
Am I doing something wrong? Is there any way to boost my query's performance?
Is it just one of those things RDBMS does better than NoSQL?

Premise

Let's say I have a business that takes orders and delivers requested items.

I would like to know the average amount of unique items per order.
My orders data is arranged per item ordered - each order has one or more records containing Order ID, Item ID and more.
I have a one node setup for development purposes
The results (performance-wise) are the same whether I have 4 GB heap space (on a 12 GB machine) or 16 GB heap space (on a 32 GB machine)
The index has billions of records, but the query filters it to about 300,000 records
The Order and Item ID's are of type keyword (textual by nature) and I have no way to change that.
In this particular case, the average unique item count is 1.65 - with many orders containing only one unique item, others contain 2 and a few contain up to 25 unique items.

The Problem

Using ElasticSearch, I would have to use Terms Aggregation to group documents by order ID, Cardinality Aggregation to get unique item count, and Average Bucket aggregation to get the average item count per order.

This takes about 23 seconds on both my setups. Same query takes less than 2 seconds with the same dataset on SQL Server.

Additional Information

ElasticSearch Query

{
   "size":0,
   "query":{
      "bool":{
         "filter":[
            {
               ...
            }
         ]
      }
   },
   "aggs":{
      "OrdersBucket":{
         "terms":{
            "field":"orderID",
            "execution_hint":"global_ordinals_hash",
            "size":10000000
         },
         "aggs":{
            "UniqueItems":{
               "cardinality":{
                  "field":"itemID"
               }
            }
         }
      },
      "AverageItemCount":{
         "avg_bucket":{
            "buckets_path":"OrdersBucket>UniqueItems"
         }
      }
   }
}

At first my query generated OutOfMemoryException which brought my server down.
Issuing the same request on my higher ram setup yielded the following circuit breaker:

[request] Data too large, data for [<reused_arrays>] would be
[14383258184/13.3gb], which is larger than the limit of
[10287002419/9.5gb]

ElasticSearch github has several (currently) open issues on this matter:

Cardinality aggregation should not reserve a fixed amount of memory per bucket #15892

global_ordinals execution mode for the terms aggregation has an adversarially impact on children aggregations that expect dense buckets #24788

Heap Explosion on even small cardinality queries in ES 5.3.1 / Kibana 5.3.1 #24359

All of which led me to use execution hint "global_ordinals_hash" which allowed the query to complete successfully (albeit taking it's time..)

Analogous SQL Query

SELECT AVG(CAST(uniqueCount.amount AS FLOAT)) FROM 
(   SELECT o.OrderID, COUNT(DISTINCT o.ItemID) AS amount 
    FROM Orders o
    WHERE ...
    GROUP BY o.OrderID 
) uniqueCount

And this, as I've said, is very very fast.

orderID field mapping

{
   "orderID":{
      "full_name":"orderID",
      "mapping":{
         "orderID":{
            "type":"keyword",
            "boost":1,
            "index":true,
            "store":false,
            "doc_values":true,
            "term_vector":"no",
            "norms":false,
            "index_options":"docs",
            "eager_global_ordinals":true,
            "similarity":"BM25",
            "fields":{
               "autocomplete":{
                  "type":"text",
                  "boost":1,
                  "index":true,
                  "store":false,
                  "doc_values":false,
                  "term_vector":"no",
                  "norms":true,
                  "index_options":"positions",
                  "eager_global_ordinals":false,
                  "similarity":"BM25",
                  "analyzer":"autocomplete",
                  "search_analyzer":"standard",
                  "search_quote_analyzer":"standard",
                  "include_in_all":true,
                  "position_increment_gap":-1,
                  "fielddata":false
               }
            },
            "null_value":null,
            "include_in_all":true,
            "ignore_above":2147483647,
            "normalizer":null
         }
      }
   }
}

I have set eager_global_ordinals trying to boost performance, but to no avail.

Sample Document

{
            "_index": "81cec0acbca6423aa3c2feed5dbccd98",
            "_type": "order",
            "_id": "AVwpLZ7GK9DJVcpvrzss",
            "_score": 0,
            "_source": {
        ...
               "orderID": "904044A",
               "itemID": "23KN",
        ...
            }
}

Irrelevant fields removed for brevity and undisclosable content

Sample Output

{
   "OrdersBucket":{
      "doc_count_error_upper_bound":0,
      "sum_other_doc_count":0,
      "buckets":[
         {
            "key":"910117A",
            "doc_count":16,
            "UniqueItems":{
               "value":16
            }
         },
         {
            "key":"910966A",
            "doc_count":16,
            "UniqueItems":{
               "value":16
            }
         },
        ...
         {
            "key":"912815A",
            "doc_count":1,
            "UniqueItems":{
               "value":1
            }
         },
         {
            "key":"912816A",
            "doc_count":1,
            "UniqueItems":{
               "value":1
            }
         }
      ]
   },
   "AverageItemCount":{
      "value":1.3975020363833832
   }
}

Any help will be much appreciated :)

Can you please share a sample document and a sample output. That would be very helpful — Richa
Edited those into the question. Though I don't see how they help much (except making the question more TL;DR worthy ;) ) — Jony Adamit
Is this possible for you to change the structure of your index. It is generally advisable to index your data in a way that is easy to query. and Since ES is a no sql database, we should better keep data in an un-normalized form. — Richa
Hi @Richa, this is actually un-normalized form... how would you un-normalize it further? Also, this question can easily be generalized - this pattern of cardinality aggregation within terms aggregation and then a pipeline aggregation isn't specific to my structure (which as far as I can see is the most un-normalized as it gets). I am currently testing the effect of making that field numeric, but the question remains - is this a sort of thing for RDBMS only or are there any measures I can take to keep using ES for this kind of work? — Jony Adamit
Fields removed for brevity include date, client id, and such. In a normalized form, they won't exist on item level, but on order level only. — Jony Adamit

Jony Adamit Jony Adamit · Accepted Answer · 2017-05-29T06:37:48

Apparently SQL Server does a good job at caching those results.
Further investigation showed initial query to be taking the same time as with ElasticSearch.

I will look into why those results are not cached properly via ElasticSearch.

I also managed to convert the order ID to an integer, which dramatically enhanced performance (though same performance gain with SQL Server as well).

Also, as advised by Mark Harwood on the Elastic Forum, specifying precision_threshold on the cardinality aggregation lowered memory consumption a lot!

So the answer is that for this particular kind of query, ES performs at least as well as SQL Server.