Why do Spark shells (PySpark or Scala) run in client mode instead of cluster mode?

Question

I've always understood the Spark shells, be it PySpark or Scala, run in the client mode. And correct me if I'm wrong, there isn't an out-of-the-box configuration to use them in cluster mode.

Why is this the case? What makes cluster mode unsuitable for these interactive shells?

Network latency between the client and driver may be one factor. And if YARN is used, there may a higher initial startup time since cluster resources for the driver needs to be provisioned from the YARN Resource Manager. But it seems to me these two factors are not serious blockers.

EDIT
The question Spark-submit / spark-shell > difference between yarn-client and yarn-cluster mode is related, but doesn't focus on (and the answers do not cover) why the shells cannot run in cluster mode.

pyspark --deploy-mode cluster
Error: Cluster deploy mode is not applicable to Spark shells.

Possibly a duplicate: stackoverflow.com/questions/56706234/… — abiratsis
Possible duplicate of Spark-submit / spark-shell > difference between yarn-client and yarn-cluster mode — mazaneicha
@mazaneicha That question is related, but it doesn't focus on (and the answers do not cover) why the shells cannot run in cluster mode. If you happen to know, could you explain this in an answer? — flow2k
Shell needs to run on your (client) machine in order to accept input. I think the answer with the highest score makes it clear. — mazaneicha
@mazaneicha You say "Shell needs to run on your (client) machine in order to accept input." I don't think this addresses the question; it simply points out the shell accepts input, which is a well-known fact. Could you elaborate further? Some Jupyter PySpark kernels run in cluster mode but are also interactive. — flow2k

ShounenG ShounenG · Accepted Answer · 2020-12-09T11:50:58

Because Spark Shell is used for interactive queries, thus the Spark Driver must be runing on your host (not as a container inside the cluster). In the other words, we use Spark Driver to connect to the cluster, the driver is the interface to process the programming -- Interactive programming.

Why do Spark shells (PySpark or Scala) run in client mode instead of cluster mode?

1 Answers