1
votes

Pardon my ignorance if this question is too broad or vague. I'm playing with Elasticsearch with out-of-the-box settings on my laptop and it works just great.

It's a 8 core Macbook and 6G of heap is given to Elasticsearch and it works pretty well for a large dataset (just over 7 Million documents).

I'm keen to set up a multinode cluster (2 machines) and before I assume a few things, I would like to get expert views on a few key points.

I understand "How many shards per node" is a very subjective question and one answer will not fit all situations.

I understand that sharding helps to distribute the indices to multiple nodes so that the storage footprint is optimal per node. But mainly, I'd like to understand how the sharding effects on the query speed & effective CPU cores utilization.

When a single search query comes in, does ES fire internal subqueries to all the shards in parallel, and therefore it can keep all the cores busy (if the no of shards equals no of cores)?

Can I also be pointed to a few useful links that will help me? Thanks.

2

2 Answers

1
votes

Your understanding is pretty much spot on.

The basic concept to understand is that one query on a single shard will use one thread. One thread boils down to one core CPU. If the query needs to touch multiple shards, then ES will make sure the shards involved are queried. This means each shard will do its part of the job using one thread.

The size of the shard and the complexity of the query translates to how much time is being spent in that thread. But the OS will not give one CPU core to that thread all the time, the OS it's scheduling jobs and other processes get a slice of the CPU core.

Ideally, yes, you would have number of shards = number of cores, but rarely clusters out there use this setup. Mainly those clusters that have a lot of concurrent requests per seconds and they demand a strict response time.

0
votes

Thanks for the response.

Just a summary of my understanding to get it validated.

No of shards == No of cores

(-)

  • Bigger shards.
  • A thread could take more time to search a single shard therefore other threads could be queued up and made to wait?

(+)

  • Optimal core utilization.
  • Less chances of context switching overhead as the no of threads are limited.

No of shards > No of cores

(-)

  • More threads will be spawned for queries and context switching overhead may apply.
  • More threads perhaps need more memory for thread stack etc.
  • More shards could potentially need more housekeeping (ie managing file handles etc) by elasticsearch.

(+)

  • A thread could take less time(relatively) to search a single shard
  • Could process more concurrent requests (as the no of threads are more).

Eventually, it depends on which one of these gives a best balance between: Available Hardware, Query Speed and Concurrency factor and I think it requires quite a bit of experimenting. Or in otherwords, which hurts little.