4
votes

I want to use Airflow for orchestration of jobs that includes running some pig scripts, shell scripts and spark jobs.

Mainly on Spark jobs, I want to use Apache Livy but not sure whether it is good idea to use or run spark-submit.

What is best way to track Spark job using Airflow if even I submitted?

1

1 Answers

3
votes

My assumption is you an application JAR containing Java / Scala code that you want to submit to remote Spark cluster. Livy is arguably the best option for remote spark-submit when evaluated against other possibilities:

  • Specifying remote master IP: Requires modifying global configurations / environment variables
  • Using SSHOperator: SSH connection might break
  • Using EmrAddStepsOperator: Dependent on EMR

Regarding tracking

  • Livy only reports state and not progress (% completion of stages)
  • If your'e OK with that, you can just poll the Livy server via REST API and keep printing logs in console, those will appear on task logs in WebUI (View Logs)

Other considerations

  • Livy doesn't support reusing SparkSession for POST/batches request
  • If that's imperative, you'll have to write your application code in PySpark and use POST/session requests

References


Useful links