Spark shuffle read takes significant time for small data

Question

We are running the following stage DAG and experiencing long shuffle read time for relatively small shuffle data sizes (about 19MB per task)

One interesting aspect is that waiting tasks within each executor/server have equivalent shuffle read time. Here is an example of what it means: for the following server one group of tasks waits about 7.7 minutes and another one waits about 26 s.

Here is another example from the same stage run. The figure shows 3 executors / servers each having uniform groups of tasks with equal shuffle read time. The blue group represents killed tasks due to speculative execution:

Not all executors are like that. There are some that finish all their tasks within seconds pretty much uniformly, and the size of remote read data for these tasks is the same as for the ones that wait long time on other servers. Besides, this type of stage runs 2 times within our application runtime. The servers/executors that produce these groups of tasks with large shuffle read time are different in each stage run.

Here is an example of task stats table for one of the severs / hosts:

It looks like the code responsible for this DAG is the following:

output.write.parquet("output.parquet")
comparison.write.parquet("comparison.parquet")
output.union(comparison).write.parquet("output_comparison.parquet")
val comparison = data.union(output).except(data.intersect(output)).cache()
comparison.filter(_.abc != "M").count()

We would highly appreciate your thoughts on this.

Strange. Code and data samples would be appreciated. I see every step of that DAG has a cache call, are you caching everything? — Garren S
Hello. Thank you for your question. I posted the code in the description above. We are caching only when we think it is needed. — Dimon
The except and intersect calls are on my radar for concerns. Your DAG references a sortmergejoin; do you know already what line(s) are causing the trouble? — Garren S
We think that sortmergejoin comes from except or intersect in the above code. Another piece of information is that we are using MesosExternalShuffleService — Dimon

Dimon Dimon · Accepted Answer · 2017-09-05T17:58:32

Apparently the problem was JVM garbage collection (GC). The tasks had to wait until GC is done on the remote executors. The equivalent shuffle read time resulted from the fact that several tasks were waiting on a single remote host performing GC. We followed advise posted here and the problem decreased by an order of magnitude. There is still small correlation between GC time on remote hosts and local shuffle read time. In the future we think to try shuffle service.

Spark shuffle read takes significant time for small data

2 Answers