1
votes

I've been trying to setup an elasticsearch cluster for processing some log data from some 3D printers . we are having more than 850K documents generated each day for 20 machines . each of them has it own index .

Right now we have the data of 16 months with make it about 410M records to index in each of the elasticsearch index . we are processing the data from CSV files with spark and writing to an elasticsearch cluster with 3 machines each one of them has 16GB of RAM and 16 CPU cores . but each time we reach about 10-14M doc/index we are getting a network error .

Job aborted due to stage failure: Task 173 in stage 9.0 failed 4 times, most recent failure: Lost task 173.3 in stage 9.0 (TID 17160, wn21-xxxxxxx.ax.internal.cloudapp.net, executor 3): org.elasticsearch.hadoop.rest.EsHadoopNoNodesLeftException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[X.X.X.X:9200]]

I'm sure this is not a network error it's just elasticsearch cannot handle more indexing requests .

To solve this , I've tried to tweak many elasticsearch parameters such as : refresh_interval to speed up the indexation and get rid of the error but nothing worked . after monitoring the cluster we think that we should scale it up.

we also tried to tune the elasticsearch spark connector but with no result .

So I'm looking for a right way to choose the cluster size ? is there any guidelines on how to choose your cluster size ? any highlights will be helpful .

NB : we are interested mainly in indexing data since we have only one query or two to run on data to get some metrics .

2

2 Answers

3
votes

I would start by trying to split up the indices by month (or even day) and then search across an index pattern. Example: sensor_data_2018.01, sensor_data_2018.02, sensor_data_2018.03 etc. And search with an index pattern of sensor_data_*

Some things which will impact what cluster size you need will be:

  • How many documents
  • Average size of each document
  • How many messages/second are being indexed
  • Disk IO speed

I think your cluster should be good enough to handle that amount of data. We have a cluster with 3 nodes (8CPU / 61GB RAM each), ~670 indices, ~3 billion documents, ~3TB data and have only had indexing problems when the indexing rate exceeds 30,000 documents/second. Even then only the indexing of a few documents will fail and can be successfully retried after a short delay. Our implementation is also very indexing heavy with minimal actual searching.

I would check the elastic search server logs and see if you can find a more detailed error message. Possible look for RejectedExecutionException's. Also check the cluster health and node stats when you start to receive the failures which might shed some more light on whats occurring. If possible implement a re-try and backoff when failures start to occur to give ES time to catch up to the load.

Hope that helps a bit!

1
votes

This is a network error, saying the data node is ... lost. Maybe a crash, you can check the elasticsearch logs to see whats going on.

The most important thing to understand with elasticsearch4Hadoop is how work is parallelized:

  • 1 Spark partition by 1 elasticsearch shard

The important thing is sharding, this is how you load-balance the work with elasticsearch. Also, refresh_interval must be > 30 secondes, and, you should disable replication when indexing, this is very basic configuration tuning, I am sure you can find many advises about that on documentation.

With Spark, you can check on web UI (port 4040) how the work is split into tasks and partitions, this help a lot. Also, you can monitor the network bandwidth between Spark and ES, and es node stats.