Spark stateful streaming job hangs at checkpointing to S3 after long uptime

Question

I've been recently stress testing our Spark Streaming app. The stress testing ingests about 20,000 messages/sec with message sizes varying between 200bytes - 1K into Kafka, where Spark Streaming is reading batches every 4 seconds.

Our Spark cluster runs on version 1.6.1 with Standalone cluster manager, and we're using Scala 2.10.6 for our code.

After about a 15-20 hour run, one of the executors which is initiating a checkpoint (done at a 40 second interval) is stuck with the following stack trace and never completes:

java.net.SocketInputStream.socketRead0(Native Method) java.net.SocketInputStream.socketRead(SocketInputStream.java:116) java.net.SocketInputStream.read(SocketInputStream.java:170) java.net.SocketInputStream.read(SocketInputStream.java:141) sun.security.ssl.InputRecord.readFully(InputRecord.java:465) sun.security.ssl.InputRecord.readV3Record(InputRecord.java:593) sun.security.ssl.InputRecord.read(InputRecord.java:532) sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:533) org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:401) org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:144) org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:131) org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:610) org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:445) org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:326) org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:277) org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestHead(RestStorageService.java:1038) org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2250) org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectDetailsImpl(RestStorageService.java:2179) org.jets3t.service.StorageService.getObjectDetails(StorageService.java:1120) org.jets3t.service.StorageService.getObjectDetails(StorageService.java:575) org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:174) sun.reflect.GeneratedMethodAccessor32.invoke(Unknown Source) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:497) org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) org.apache.hadoop.fs.s3native.$Proxy18.retrieveMetadata(Unknown Source) org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:472) org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424) org.apache.spark.rdd.ReliableCheckpointRDD$.writePartitionToCheckpointFile(ReliableCheckpointRDD.scala:168) org.apache.spark.rdd.ReliableCheckpointRDD$$anonfun$writeRDDToCheckpointDirectory$1.apply(ReliableCheckpointRDD.scala:136) org.apache.spark.rdd.ReliableCheckpointRDD$$anonfun$writeRDDToCheckpointDirectory$1.apply(ReliableCheckpointRDD.scala:136) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) org.apache.spark.scheduler.Task.run(Task.scala:89) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745)

While being stuck, the spark driver refuses to continue processing incoming batches, and creates a huge backlog of queued batches which can't be processed until releasing the task that is "stuck".

Further more, looking at the driver thread dump under streaming-job-executor-0 clearly shows that it is waiting for this task to complete:

java.lang.Object.wait(Native Method) java.lang.Object.wait(Object.java:502) org.apache.spark.scheduler.JobWaiter.awaitResult(JobWaiter.scala:73) org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:612) org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) org.apache.spark.SparkContext.runJob(SparkContext.scala:1922) org.apache.spark.rdd.ReliableCheckpointRDD$.writeRDDToCheckpointDirectory(ReliableCheckpointRDD.scala:135) org.apache.spark.rdd.ReliableRDDCheckpointData.doCheckpoint(ReliableRDDCheckpointData.scala:58) org.apache.spark.rdd.RDDCheckpointData.checkpoint(RDDCheckpointData.scala:74) org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply$mcV$sp(RDD.scala:1682) org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1679) org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1.apply(RDD.scala:1679) org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) org.apache.spark.rdd.RDD.doCheckpoint(RDD.scala:1678) org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1$$anonfun$apply$mcV$sp$1.apply(RDD.scala:1684) org.apache.spark.rdd.RDD$$anonfun$doCheckpoint$1$$anonfun$apply$mcV$sp$1.apply(RDD.scala:1684) scala.collection.immutable.List.foreach(List.scala:318)

Has anyone experienced such an issue?

Are you using AWS EMR or a custom managed cluster on EC2 ? It seems like EMR isn't suited for long stressful job as the state of the Hadoop cluster decays with time. media.amazonwebservices.com/AWS_Amazon_EMR_Best_Practices.pdf page 27-28 — eliasah
@eliasah Nope, this isn't EMR. It's a Spark cluster I've manually set up running with standalone cluster manager — Yuval Itzchakov
It is somewhat similar to stackoverflow.com/q/34879092/1560062, isn't it? — zero323
@zero323 I'm not seeing similarities. The problem there was that he was seeing timeouts and exceptions, here the case is a complete hang of the entire job. — Yuval Itzchakov

Yuval Itzchakov Yuval Itzchakov · Accepted Answer · 2016-07-27T09:16:14

The socket hang happens due to a bug in the HttpClient library used by org.jets3t where the SSL handshake doesn't use the specified timeout. You can find the issue details here.

This bug reproduces in HttpClient versions below v4.5.1, where it was fixed. Unfortunately, Spark 1.6.x uses v4.3.2, which doesn't have the supplied fix.

There are three possible workaround I've thought of so far:

Use Spark's speculation mechanism via the spark.speculation configuration settings. This helps with the edge cases of the hang as it reproduces rarely and under load. Note this can cause some false positives in the beginning of the streaming job where spark doesn't have a good impression of how long running your median task is, but it is definitely not something that causes a noticeable lag.

The documentation says:

If set to "true", performs speculative execution of tasks. This means if one or more tasks are running slowly in a stage, they will be re-launched.

You turn it on by supplying the flags to spark-submit:
```
spark-submit  \
--conf "spark.speculation=true" \
--conf "spark.speculation.multiplier=5" \
```
For more on the different settings you can pass see the Spark Configuration page
Manually passing HttpClient v4.5.1 or above to Sparks classpath, so it can load this JAR prior to one it has in it's uber JAR. This can be a little difficult as the class loading process with Spark is a bit cumbersome. This means that you can do something along the lines of:
```
CP=''; for f in /path/to/httpcomponents-client-4.5.2/lib/*.jar; do CP=$CP$f:; done
SPARK_CLASSPATH="$CP" sbin/start-master.sh   # on your master machine
SPARK_CLASSPATH="$CP" sbin/start-slave.sh 'spark://master_name:7077' 
```
Or simply update the specific version of the JAR to SPARK_CLASSPATH in spark-env.sh.
Updating to Spark 2.0.0. The new version of Spark uses HttpClient v4.5.2 which resolves this issue.

Spark stateful streaming job hangs at checkpointing to S3 after long uptime

1 Answers