0
votes

I've set up a Hadoop cluster on Amazon EC2 with NameNode/DataNode and some other services. My ingestion job brings the data into the EC2 HDFS cluster (let's say hdfs://ec2-hdfs/).

Now I'm having a pipeline which runs as a weekly batch. I'm launching a new Amazon EMR cluster for running my computation. Once the processing completes, I will terminate the EMR cluster.

The input for my spark job that needs to run in EMR is in EC2 HDFS (hdfs://ec2-hdfs/). How can I access it from the newly created EMR cluster? I believe there should be some option (bootstrap/VPC/subnet) available during the EMR Cluster launch.

1

1 Answers

1
votes

You would have to bootstrap the fs.defaultFS from core-site.xml to point at the Namenode for the persistent cluster whenever the EMR cluster starts, or you could explicitly set hdfs://namenode:port:/ec2-hdfs within your code.