0
votes

I've always understood the Spark shells, be it PySpark or Scala, run in the client mode. And correct me if I'm wrong, there isn't an out-of-the-box configuration to use them in cluster mode.

Why is this the case? What makes cluster mode unsuitable for these interactive shells?

Network latency between the client and driver may be one factor. And if YARN is used, there may a higher initial startup time since cluster resources for the driver needs to be provisioned from the YARN Resource Manager. But it seems to me these two factors are not serious blockers.

EDIT
The question Spark-submit / spark-shell > difference between yarn-client and yarn-cluster mode is related, but doesn't focus on (and the answers do not cover) why the shells cannot run in cluster mode.

pyspark --deploy-mode cluster
Error: Cluster deploy mode is not applicable to Spark shells.
1
@mazaneicha That question is related, but it doesn't focus on (and the answers do not cover) why the shells cannot run in cluster mode. If you happen to know, could you explain this in an answer?flow2k
Shell needs to run on your (client) machine in order to accept input. I think the answer with the highest score makes it clear.mazaneicha
@mazaneicha You say "Shell needs to run on your (client) machine in order to accept input." I don't think this addresses the question; it simply points out the shell accepts input, which is a well-known fact. Could you elaborate further? Some Jupyter PySpark kernels run in cluster mode but are also interactive.flow2k

1 Answers

0
votes

Because Spark Shell is used for interactive queries, thus the Spark Driver must be runing on your host (not as a container inside the cluster). In the other words, we use Spark Driver to connect to the cluster, the driver is the interface to process the programming -- Interactive programming.