I currently automate my Apache Spark Pyspark scripts using clusters of EC2s using Sparks preconfigured ./ec2 directory. For automation and scheduling purposes, I would like to use Boto EMR module to send scripts up to the cluster.
I was able to bootstrap and install Spark on a cluster of EMRs. I am also able to launch a script on EMR by using my local machine's version of pyspark, and setting master like such:
$: MASTER=spark://<insert EMR master node of cluster here> ./bin/pyspark <myscriptname.py>
However, this requires me to run that script locally, and thus I am not able to fully leverage Boto's ability to 1) start the cluster 2) add the script steps and 3) stop the cluster. I've found examples using script-runner.sh and emr "step" commands for spark-shell (scala), but I assume there is an easier way to do this with the Python module (pyspark). Thanks so much in advance!