I am considering using AWS EMR Spark to run a Spark application against very large Parquet files stored on S3. The overall flow here is that a Java process would upload these large files to S3, and I'd like to automatically trigger the running of a Spark job (injected with the S3 keyname(s) of the files uploaded) on those files.
Ideally, there would be some kind of S3-based EMR trigger available to wire up; that is, I configure EMR/Spark to "listen" to an S3 bucket and to kick off a Spark job when an upsertis made to that bucket.
If no such trigger exists, I could probably kludge something together, such as kick off a Lambda from the S3 event, and have the Lambda somehow trigger the EMR Spark job.
However my understanding (please correct me if I'm wrong) is that the only way to kick off a Spark job is to:
- Package the job up as an executable JAR file; and
- Submit it to the cluster (EMR or otherwise) via the
spark-submit
shell script
So if I have to do the Lambda-based kludge, I'm not exactly sure what the best way to trigger the EMR/Spark job is, seeing that Lambdas don't natively carry spark-submit
in their runtimes. And even if I configured my own Lambda runtime (which I believe is now possible to do), this solution already feels really wonky and fault-intolerant.
Anybody ever trigger an EMR/Spark job from an S3 trigger or any AWS trigger before?