Best practices for ElasticSearch initial bulk import

Question

I'm running a docker setup with ElasticSearch, Logstash, Filebeat and Kibana inspired by the Elastic Docker Compose. I need to initial load 15 GB og logfiles into the system (Filebeat->Logstash->ElasticSearch) but I'm having some issues with performance.

It seems that Filebeat/Logstash is outputting too much work for ElasticSearch. After some time I begin to see a bunch of errors in ElasticSearch like this:

[INFO ][o.e.i.IndexingMemoryController] [f8kc50d] now throttling indexing for shard [log-2017.06.30]: segment writing can't keep up

I've found this old documentation article on how to disable merge throttling: https://www.elastic.co/guide/en/elasticsearch/guide/master/indexing-performance.html#segments-and-merging.

PUT /_cluster/settings
{
    "transient" : {
        "indices.store.throttle.type" : "none" 
    }
}

But in current version (ElasticSearch 6) it gives me this error:

{
  "error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "transient setting [indices.store.throttle.type], not dynamically updateable"
      }
    ],
    "type": "illegal_argument_exception",
    "reason": "transient setting [indices.store.throttle.type], not dynamically updateable"
  },
"status": 400
}

How can I solve the above issue?

The VM has 4 CPU cores (Intel Xeon E5-2650) and ElasticSearch is assigned 4GB of RAM, Logstash and Kibana 1GB each. Swapping is disabled using "swapoff -a". X-pack and monitoring is enabled. I only have one ES node for this log server. Do I need to have multiple node for this initial bulk import?

EDIT1:

Changing the number_of_replicas and refresh_interval seems to make it perform better. Still testing.

PUT /log-*/_settings
{
    "index.number_of_replicas" : "0",
    "index.refresh_interval" : "-1"
}

what are your cluster stats (number of nodes, shards, replicas, what kind of hardware). Do you have any other stats, like iostat, JVM stats etc. Have you changed any other settings? — Egor
@Egor Thanks for your commend. I've updated the question with additional information. — dhrm
You could perhaps reduce the number of worker threads in logstash (-w option at startup). For elasticsearch, I remember that you're supposed to give half the available RAM and leave the rest for the filesystem ("Set Xmx to no more than 50% of your physical RAM" from elastic.co/guide/en/elasticsearch/reference/master/…). — baudsp

Egor Egor · Accepted Answer · 2017-12-05T08:12:38

Most likely the bottleneck is IO (you can confirm this running iostat, also it would be useful if you post ES monitoring screenshots), so you need to reduce pressure on it.

Default ES configuration causes generation of many index segments during a bulk load. To fix this, for the bulk load, increase index.refresh_interval (or set it to -1) - see doc. The default value is 1 sec, which causes new segment to be created every 1 second, also try to increase batch size and see if it helps.

Also if you use spinning disks,set index.merge.scheduler.max_thread_count to 1. This will allow only one thread to perform segments merging and will reduce contention for IO between segments merging and indexing.

Best practices for ElasticSearch initial bulk import

1 Answers