3
votes

What is the best way to deal with a surge in log messages being written to an ElasticSearch cluster in a standard ELK setup?

We use a standard ELK (ElasticSearch/Logstash/Kibana) set-up in AWS for our websites logging needs.

We have an autoscaling group of Logstash instances behind a load balancer, that log to an autoscaling group of ElasticSearch instances behind another load balancer. We then have a single instance serving Kibana.

For day to day business we run 2 Logstash instances and 2 ElasticSearch instances.

Our website experiences short periods of high level traffic during events - our traffic increases by about 2000% during these events. We know about these occurring events well in advance.

Currently we just increase the number of ElasticSearch instances temporarily during the event. However we have had issues where we have subsequently scaled down too quickly, meaning we have lost shards and corrupted our indexes.

I've been thinking of setting the auto_expand_replicas setting to "1-all" to ensure each node has a copy of all the data, so we don't need to worry about how quickly we scale up or down. How significant would the overhead of transferring all the data to new nodes be? We currently only keep about 2 weeks of log data - this works out around 50gb in all.

I've also seen people mention using a separate auto scaling group of non-data nodes to deal with increases of search traffic, while keep the number of data nodes the same. Would this help in a write heavy situation, such as the event I previously mentioned?

1
Any particular reason why you use temporary autoscaling of ES? The number of nodes you need in a cluster is mostly dependent on how much data you want to store and how you query the data, not the inbound message rate.Magnus Bäck
I was under (the perhaps naive) impression that by increasing the number of nodes in an ES cluster, you would be able to increase the number of inbound messages processed.bScutt
That's probably true, but you've discovered yourself that it isn't very practical. I think there are better ways of dealing with spikes, like having an intermediate buffer (e.g. a message broker) that Logstash can pull messages from. This obviously requires that you don't have hard requirements on the availability of log messages.Magnus Bäck
I did consider some kind of buffer for Logstash to pull messages in from, however being able to view logs during these events in real time is extremely valuable for us, so I was hoping to avoid this by scaling ElasticSearch...bScutt
You don't describe any of the resources available (RAM, specifically), any debugging or tuning you've already done, etc. You could be going from 1 EPS to 20 EPS for all we know.Alain Collins

1 Answers

5
votes

My Advice

Your best bet is using Redis as a broker in between Logstash and Elasticsearch:

Architecture diagram of proposed solution

This is described on some old Logstash docs but is still pretty relevant.

Yes, you will see a minimal delay between the logs being produced and them eventually landing in Elasticsearch, but it should be minimal as the latency between Redis and Logstash is relatively small. In my experience Logstash tends to work through the backlog on Redis pretty quickly.

This kind of setup also gives you a more robust setup where even if Logstash goes down, you're still accepting the events through Redis.

Just scaling Elasticsearch

As to your question on whether or not extra non-data nodes will help in write-heavy periods: I don't believe so, no. Non-data nodes are great when you're seeing lots of searches (reads) being performed, as they delegate the search to all the data nodes, and then aggregate the results before sending them back to the client. They take away the load of aggregating the results from the data nodes.

Writes will always involve your data nodes.

I don't think adding and removing nodes is a great way to cater for this.

You can try to tweak the thread pools and queues in your peak periods. Let's say normally you have the following:

threadpool:
  index:
    type: fixed
    size: 30
    queue_size: 1000
  search
    type: fixed
    size: 30
    queue_size: 1000

So you have an even amount of search and index threads available. Just before your peak time, you can change the setting (on the run) to the following:

threadpool:
  index:
    type: fixed
    size: 50
    queue_size: 2000
  search
    type: fixed
    size: 10
    queue_size: 500

Now you have a lot more threads doing indexing, allowing for a faster indexing throughput, while search is put on the backburner. For good measure I've also increased the queue_size to allow for more of a backlog to build up. This might not work as expected, though, and experimentation and tweaking is recommended.