TL;DR
My ElasticSearch query takes forever compared to the same query on SQL Server.
Am I doing something wrong? Is there any way to boost my query's performance?
Is it just one of those things RDBMS does better than NoSQL?
Premise
Let's say I have a business that takes orders and delivers requested items.
- I would like to know the average amount of unique items per order.
- My orders data is arranged per item ordered - each order has one or more records containing Order ID, Item ID and more.
- I have a one node setup for development purposes
- The results (performance-wise) are the same whether I have 4 GB heap space (on a 12 GB machine) or 16 GB heap space (on a 32 GB machine)
- The index has billions of records, but the query filters it to about 300,000 records
- The Order and Item ID's are of type keyword (textual by nature) and I have no way to change that.
- In this particular case, the average unique item count is 1.65 - with many orders containing only one unique item, others contain 2 and a few contain up to 25 unique items.
The Problem
Using ElasticSearch, I would have to use Terms Aggregation to group documents by order ID, Cardinality Aggregation to get unique item count, and Average Bucket aggregation to get the average item count per order.
This takes about 23 seconds on both my setups. Same query takes less than 2 seconds with the same dataset on SQL Server.
Additional Information
ElasticSearch Query
{
"size":0,
"query":{
"bool":{
"filter":[
{
...
}
]
}
},
"aggs":{
"OrdersBucket":{
"terms":{
"field":"orderID",
"execution_hint":"global_ordinals_hash",
"size":10000000
},
"aggs":{
"UniqueItems":{
"cardinality":{
"field":"itemID"
}
}
}
},
"AverageItemCount":{
"avg_bucket":{
"buckets_path":"OrdersBucket>UniqueItems"
}
}
}
}
At first my query generated OutOfMemoryException which brought my server down.
Issuing the same request on my higher ram setup yielded the following circuit breaker:
[request] Data too large, data for [<reused_arrays>] would be [14383258184/13.3gb], which is larger than the limit of [10287002419/9.5gb]
ElasticSearch github has several (currently) open issues on this matter:
Cardinality aggregation should not reserve a fixed amount of memory per bucket #15892
Heap Explosion on even small cardinality queries in ES 5.3.1 / Kibana 5.3.1 #24359
All of which led me to use execution hint "global_ordinals_hash" which allowed the query to complete successfully (albeit taking it's time..)
Analogous SQL Query
SELECT AVG(CAST(uniqueCount.amount AS FLOAT)) FROM
( SELECT o.OrderID, COUNT(DISTINCT o.ItemID) AS amount
FROM Orders o
WHERE ...
GROUP BY o.OrderID
) uniqueCount
And this, as I've said, is very very fast.
orderID field mapping
{
"orderID":{
"full_name":"orderID",
"mapping":{
"orderID":{
"type":"keyword",
"boost":1,
"index":true,
"store":false,
"doc_values":true,
"term_vector":"no",
"norms":false,
"index_options":"docs",
"eager_global_ordinals":true,
"similarity":"BM25",
"fields":{
"autocomplete":{
"type":"text",
"boost":1,
"index":true,
"store":false,
"doc_values":false,
"term_vector":"no",
"norms":true,
"index_options":"positions",
"eager_global_ordinals":false,
"similarity":"BM25",
"analyzer":"autocomplete",
"search_analyzer":"standard",
"search_quote_analyzer":"standard",
"include_in_all":true,
"position_increment_gap":-1,
"fielddata":false
}
},
"null_value":null,
"include_in_all":true,
"ignore_above":2147483647,
"normalizer":null
}
}
}
}
I have set eager_global_ordinals trying to boost performance, but to no avail.
Sample Document
{
"_index": "81cec0acbca6423aa3c2feed5dbccd98",
"_type": "order",
"_id": "AVwpLZ7GK9DJVcpvrzss",
"_score": 0,
"_source": {
...
"orderID": "904044A",
"itemID": "23KN",
...
}
}
Irrelevant fields removed for brevity and undisclosable content
Sample Output
{
"OrdersBucket":{
"doc_count_error_upper_bound":0,
"sum_other_doc_count":0,
"buckets":[
{
"key":"910117A",
"doc_count":16,
"UniqueItems":{
"value":16
}
},
{
"key":"910966A",
"doc_count":16,
"UniqueItems":{
"value":16
}
},
...
{
"key":"912815A",
"doc_count":1,
"UniqueItems":{
"value":1
}
},
{
"key":"912816A",
"doc_count":1,
"UniqueItems":{
"value":1
}
}
]
},
"AverageItemCount":{
"value":1.3975020363833832
}
}
Any help will be much appreciated :)