How to scale elasticsearch that it can index a large number of documents per second?

Question

I have a basic elasticsearch cluster at the moment in which I am using a river to index data. I want to scale for future growth in two phases. Number of documents indexed per second is what could be the bottleneck.

Phase 1: Indexing 100 documents per second into elasticsearch
Phase 2: Indexing 10000 documents per second into elasticsearch

How should I go about it?

Thanks-in-advance!

Edit:
I am trying to index the Twitter stream. Each document = around 2 KB. Hardware is flexible. Right now I have magnetic disks (with 50 GB RAM) but getting SSD (and better config) is no biggie.

I'll use the bulk api for that purpose but if you don't give us more information about your data size or specificity of your hardware and what you are trying to achieve, we will not be able to help you! — eliasah
yes, I am using the elasticsearch twitter river at the moment. But if it cant keep up in the future, I am fine with writing my own code to stream and index the tweets... — huhahihi
First, rivers are deprecated and they will be removed in further versions. Second, Logstash is more flexible than rivers. Ex: you might want to perform an extra preprocessing on the input. The rivers don't allow that unlike Logstash. — eliasah

pgratton pgratton · Accepted Answer · 2015-01-15T17:02:51

A few highlights that come from experiments and articles:

Since you will do a lot of writing, make sure you start with a good number of primary shards. You can make that decision based on the number of nodes you will have/need. Basically, you want to make sure that your primary shards are distributed on different nodes so they can share the work. You can't change the number of primary shards once your index is created, so think it out.
Do not assign more than 50% of your machine's memory to ES. The rest will be used by Lucene (see http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/heap-sizing.html)
Use a SSD. When indexing, I/O plays a big role (see http://www.elasticsearch.org/blog/performance-considerations-elasticsearch-indexing/)
Generally: I/O > Memory > Multiple CPU Cores > Fast single CPU (see http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/hardware.html)
Pretty much each setup is unique, so the best way to find out what the optimal configurations are for you is to try it out. Elasticsearch has a great monitoring tool called Marvel (http://www.elasticsearch.org/overview/marvel/)

Have fun !

How to scale elasticsearch that it can index a large number of documents per second?

1 Answers