How EC2 (persistent) HDFS and EMR (transient) HDFS communicate

Question

I've set up a Hadoop cluster on Amazon EC2 with NameNode/DataNode and some other services. My ingestion job brings the data into the EC2 HDFS cluster (let's say hdfs://ec2-hdfs/).

Now I'm having a pipeline which runs as a weekly batch. I'm launching a new Amazon EMR cluster for running my computation. Once the processing completes, I will terminate the EMR cluster.

The input for my spark job that needs to run in EMR is in EC2 HDFS (hdfs://ec2-hdfs/). How can I access it from the newly created EMR cluster? I believe there should be some option (bootstrap/VPC/subnet) available during the EMR Cluster launch.

OneCricketeer OneCricketeer · Accepted Answer · 2019-07-16T19:37:46

You would have to bootstrap the fs.defaultFS from core-site.xml to point at the Namenode for the persistent cluster whenever the EMR cluster starts, or you could explicitly set hdfs://namenode:port:/ec2-hdfs within your code.

How EC2 (persistent) HDFS and EMR (transient) HDFS communicate

1 Answers