I am new to airflow automation, i dont now if it is possible to do this with apache airflow(or luigi etc) or should i just make a long bash file to do this.
I want to build dag for this
- Create/clone a cluster on AWS EMR
- Install python requirements
- Install pyspark related libararies
- Get latest code from github
- Submit spark job
- Terminate cluster on finish
for individual steps, i can make .sh files like below(not sure if it is good to do this or not) but dont know how to do it in airflow
1) creating a cluser with cluster.sh
aws emr create-cluster \
--name "1-node dummy cluster" \
--instance-type m3.xlarge \
--release-label emr-4.1.0 \
--instance-count 1 \
--use-default-roles \
--applications Name=Spark \
--auto-terminate
2 & 3 & 4) clone git and install requirements codesetup.sh
git clone some-repo.git
pip install -r requirements.txt
mv xyz.jar /usr/lib/spark/xyz.jar
5) Running spark job sparkjob.sh
aws emr add-steps --cluster-id <Your EMR cluster id> --steps Type=spark,Name=TestJob,Args=[--deploy-mode,cluster,--master,yarn,--conf,spark.yarn.submit.waitAppCompletion=true,pythonjob.py,s3a://your-source-bucket/data/data.csv,s3a://your-destination-bucket/test-output/],ActionOnFailure=CONTINUE
6) Not sure, may be this
terminate-clusters
--cluster-ids <value> [<value>...]
Finally this all can be executed as one .sh file. I need to know the good approach to this with airflow/luigi.
What i found:
I find this post to be close but its outdated(2016) and misses the connections and code for playbooks
https://www.agari.com/email-security-blog/automated-model-building-emr-spark-airflow/