How to chain multiple jobs in Apache Spark

Question

I would like to know is there a way to chain the jobs in Spark, so the output RDD (or other format) of first job is passed as input to the second job ?

Is there any API for it from Apache Spark ? Is this even idiomatic approach ?

From what I found is that there is a way to spin up another process through the yarn client for example Spark - Call Spark jar from java with arguments, but this assumes that you save it to some intermediate storage between jobs.

Also there are runJob and submitJob on SparkContext, but are they good fit for it ?

you will have to provide a handle in chaining your jobs which will pass on the computed RDD from each job. To avoid re execution of the DAG of the finally computed RDD , you will have to persist of cache it in memory , let the second job transformations work on that RDD .. unpersist the RDD and do the same operation for the RDD which came as output in second job. — Aviral Kumar
Spark job =!= Spark application. You can have multiple jobs in an application, you cannot, in general, share RDDs between applications. — Alper t. Turker

Anil Anil · Accepted Answer · 2018-05-14T17:02:57

Use the same RDD definition to define the input/output of your jobs. You should then be able to chain them.

The other option is to use DataFrames instead of RDD and figure out the schema at run-time.

How to chain multiple jobs in Apache Spark

1 Answers