GCP Dataproc: Directly working with Spark over Yarn Cluster

Question

I'm trying to minimize changes in my code so I'm wondering if there is a way to submit a spark-streaming job from my personal PC/VM as follows:

spark-submit --class path.to.your.Class --master yarn --deploy-mode client \
    [options] <app jar> [app options]

without using GCP SDK.

I also have to specify a directory with configuration files HADOOP_CONF_DIR which I was able to download from Ambari. Is there a way to do the same?

Thank you

There isn't a way to do this that doesn't involve cloud sdk. Either dataproc jobs submit or compute ssh -c could be used. Why do you not want to use cloud sdk? — tix
@tix Previously I'm using Spark in standalone mode and on every batchFinish I'm executing an external script. So I wanted to run Spark driver locally, in client process --deploy-mode client (I will fix my example) to be able to be able to run an external script. — Alex
To be able to use local tooling, you'd need to open your VM ports which unless you have a bridged VPC would not be advisable. if the issue is running the script, you could download it to master VM via initialization action or package it in your jar as a resource and extract when program starts. — tix
This script is running in the Context of my main service, so I cannot extract it into Dataproc master. If I will create VM in GCP in the same project as Dataproc and configure networking do you think it would be possible to run spark driver on VM? I'm just not sure where can I get HADOOP CONFIGURATION FILES — Alex

Ben Sidhom Ben Sidhom · Accepted Answer · 2019-01-31T19:04:24

Setting up an external machine as a YARN client node is generally difficult to do and not a workflow that will work easily with Dataproc.

In a comment you mention that what you really want to do is

Submit a Spark job to the Dataproc cluster.
Run a local script on each "batchFinish" (StreamingListener.onBatchCompleted?).
- The script has dependencies that mean it cannot run inside of the Dataproc master node.

Again, configuring a client node outside of the Dataproc cluster and getting it to work with spark-submit is not going to work directly. However, if you can configure your network such that the Spark driver (running within Dataproc) has access to the service/script you need to run, and then invoke that when desired.

If you run your service on a VM that has access to the network of the Dataproc cluster, then your Spark driver should be able to access the service.

GCP Dataproc: Directly working with Spark over Yarn Cluster

1 Answers