At the moment, there isn't a specialized top-level Dataproc Job type for uncompiled Spark Scala, but under the hood, spark-shell
is just using the same mechanisms as spark-submit
to run a specialized REPL driver: org.apache.spark.repl.Main
. Thus, combining this with the --files
flag available in gcloud dataproc jobs submit spark
, you can just write snippets of Scala that you may have tested in a spark-shell
or notebook session, and run that as your entire Dataproc job, assuming job.scala
is a local file on your machine:
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files job.scala \
-- -i job.scala
Just like any other file, you can also specify any Hadoop-compatible path in the --files
argument as well, such as gs://
or even hdfs://
, assuming you've already placed your job.scala
file there:
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files gs://${BUCKET}/job.scala \
-- -i job.scala
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files hdfs:///tmp/job.scala \
-- -i job.scala
If you've staged your job file onto the Dataproc master node via an init action, you'd use file:///
to specify that the file is found on the cluster's local filesystem instead of your local filesystem where you're running gcloud:
gcloud dataproc jobs submit spark --cluster ${CLUSTER} \
--class org.apache.spark.repl.Main \
--files file:///tmp/job.scala \
-- -i job.scala
Note in all cases, the file becomes a local file in the working-directory of the main driver job, so the argument to "-i" can just be a relative path to the filename.