I've set up a Hadoop cluster on Amazon EC2 with NameNode/DataNode and some other services. My ingestion job brings the data into the EC2 HDFS cluster (let's say hdfs://ec2-hdfs/
).
Now I'm having a pipeline which runs as a weekly batch. I'm launching a new Amazon EMR cluster for running my computation. Once the processing completes, I will terminate the EMR cluster.
The input for my spark job that needs to run in EMR is in EC2 HDFS (hdfs://ec2-hdfs/
). How can I access it from the newly created EMR cluster? I believe there should be some option (bootstrap/VPC/subnet) available during the EMR Cluster launch.