0
votes

I'm creating a document collection in Spark as an RDD and using the Spark read/write library from Elasticsearch. The Cluster that creates the collection is large so when it writes to ES I get the errors below indicating ES is overloaded, which does not surprise me. This does not seem to fail the job. The tasks may being retried and eventually succeed. In the Spark GUI the job is reported as having finishing successfully.

  1. is there a way to somehow throttle the ES writing lib to avoid the retries (I can't change the cluster size)?
  2. Do these errors mean that some data was not written to the index?

Here is one of many reported task failure errors, but again no job failure is reported:

2017-03-20 10:48:27,745 WARN  org.apache.spark.scheduler.TaskSetManager [task-result-getter-2] - Lost task 568.1 in stage 81.0 (TID 18982, ip-172-16-2-76.ec2.internal): org.apache.spark.util.TaskCompletionListenerException: Could not write all entries [41/87360] (maybe ES was overloaded?). Bailing out...
    at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:112)
    at org.apache.spark.scheduler.Task.run(Task.scala:102)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

The lib I'm using is

org.elasticsearch" % "elasticsearch-spark_2.10" % "2.1.2"
1
@Aditya Roy if you are talking about es.batch.size.entries and es.batch.size.bytes the params they seem have nothing to do with the max number of records which can be dumped in a single post. - pferrel

1 Answers

0
votes

Can you follow this link - https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html In the Spark conf property or in your elastic search property you need to increase the max number of records which can be dumped in a single post, and that should solve your problem.