My current elasticsearch cluster configuration has one master node & six data nodes each on independent AWS ec2 instances.
Master t2.large [2 vcpu, 8G ram]
Each data Node r4.xlarge [4 vcpu. 30.5GB]
Number of shards = 12 [is it too low?]
Everyday I run a bulk import of 110GB data from logs.
It is imported in three steps.
First, create a new index and bulk import 50GB data.
That import runs very fast and usually completes in 50-65min
Then I run the second bulk import task of about 40GB data which is actually an update of the previously imported records. [Absolutely No new record]
That update task takes about 6 hours on average.
Is there any way to speedup/optimize the whole process to run faster?
Options I am considering
1- Increase data nodes count from current 6 to 10.
OR
2- Increase the memory/CPU on each data node.
OR
3- ditch the update part altogether and import all the data into separate indices. That will need to update the query logic in the application side as well in oder to query from multiple indices but other than that, are there any demerits for multiple indices?
OR
4- Any other option which I might have overlooked?
EDIT:
I went ahead with increasing the number of nodes as a test run and I can confirm the following results.[posting here just in case it can help someone]
Note: each node's specs remained the same as above
Current System 6 nodes 12 shards
Main insert (~50G) = 54 Min
Update (~40G) = 5H30min
Test 1 System 10 nodes 24 shards
Main insert (~50G) = 51 Min
Update (~40G) = 2H15min
Test 2 System 12 nodes 24 shards
Main insert (~50G) = 51 Min
Update (~40G) = 1H36min
There is a huge improvement but still looking for suggestions though as having that many instances is economically burdensome.