0
votes

My current elasticsearch cluster configuration has one master node & six data nodes each on independent AWS ec2 instances.

Master t2.large [2 vcpu, 8G ram]

Each data Node r4.xlarge [4 vcpu. 30.5GB]

Number of shards = 12 [is it too low?]

Everyday I run a bulk import of 110GB data from logs.

It is imported in three steps.

First, create a new index and bulk import 50GB data.

That import runs very fast and usually completes in 50-65min

Then I run the second bulk import task of about 40GB data which is actually an update of the previously imported records. [Absolutely No new record]

That update task takes about 6 hours on average.

Is there any way to speedup/optimize the whole process to run faster?

Options I am considering

1- Increase data nodes count from current 6 to 10.

OR

2- Increase the memory/CPU on each data node.

OR

3- ditch the update part altogether and import all the data into separate indices. That will need to update the query logic in the application side as well in oder to query from multiple indices but other than that, are there any demerits for multiple indices?

OR

4- Any other option which I might have overlooked?

EDIT:

I went ahead with increasing the number of nodes as a test run and I can confirm the following results.[posting here just in case it can help someone]

Note: each node's specs remained the same as above

Current System 6 nodes 12 shards
Main insert (~50G) = 54 Min
Update      (~40G) = 5H30min

Test 1 System 10 nodes 24 shards
Main insert (~50G) = 51 Min
Update      (~40G) = 2H15min

Test 2 System 12 nodes 24 shards
Main insert (~50G) = 51 Min
Update      (~40G) = 1H36min

There is a huge improvement but still looking for suggestions though as having that many instances is economically burdensome.

1

1 Answers

0
votes

Increasing the data nodes and Increase the memory/CPU on each data node won't solve your problem as there won't be the significant difference in the indexing time.

Since, Updates requires Elasticsearch to first find the document and then overwrite it by creating a new index with a new version number and then deleting the old index, which tends to get slower the larger the shards get. The option 3 that you purpose will be the one of an ideal solution for it, but it can impact your querying time as it has to search in two different indices. You can avoid that by introducing a field called 'type' in the same index which can be used to distinguish the documents which will make it easy to write the query for the Es index and as well as time to fetch.

For Eg:(Your index will look something like that with the type you could fetch data )

    { 
      data:'some data'
      type:'first-inserted'

    },

   { 
      data:'some data'
      type:'second-inserted'

    }