3
votes

Reading/writing operations between EMR Spark clusters and redshift can definitely be done via. an intermediary data dump to s3.

There are spark libraries, however, which can directly treat redshift as a datasource: https://github.com/databricks/spark-redshift

Do the EMR 5.0 Spark clusters come preconfigured with the library and access credentials for redshift access?

2
see updated answer. i just did all of this work freshly yesterday and decided to rewrite my answer to be more specific. - Kristian

2 Answers

4
votes

Do the EMR 5.0 Spark clusters come preconfigured with the library and access credentials for redshift access?

No , EMR doesn't provide this library from databricks.

To access Redshift : Connectivity to Redshift doesn't require any IAM based authentication. It simply requires the EMR cluster (master/slave IP's or EMR master/slave SG's) whitelisted in Redshift's security group on its default port 5439 .

Now since the Spark executors run the COPY / LOAD commands based on spark commands and these commands require access to S3, you would need to configure IAM credentials mentioned here: https://github.com/databricks/spark-redshift#aws-credentials

To access S3 from EMR:

EMR nodes by default assume an IAM instance profile role called EMR_EC2_DefaultRole and permissions on this role define what EMR nodes and its entities (using InstanceProfileCredentialsProvider) can have access to. So you may use the 4th way mentioned in the documentation. The (AccessKey , SecretKey , Tokens) can be retrieved like the following and can be used as an Option or Parameter(temporary_aws_access_key_id , temporary_aws_secret_access_key , temporary_aws_session_token )

https://github.com/databricks/spark-redshift#parameters

//Get cred's from IAM instance profile
val provider = new InstanceProfileCredentialsProvider()
val credentials: AWSSessionCredentials = provider.getCredentials.asInstanceOf[AWSSessionCredentials]
val token = credentials.getSessionToken
val awsAccessKey = credentials.getAWSAccessKeyId
val awsSecretKey = credentials.getAWSSecretKey

The EMR_EC2_DefaultRole should have permissions to Read/Write on S3 object being used as tempdir.

Finally, EMR does include Redshift's JDBC drivers on /usr/share/aws/redshift/jdbc which can be used in spark's driver and executor classpaths(--jars).

2
votes

In order to allow access between EMR and any other AWS resource, you'll need to edit the roles (Identify and Access Management, aka "IAM") that are applied to the master / core nodes, and add permission to consume the services you need, i.e. S3 (already enabled by default), Redshift, etc.

Sidenote, in some cases you get away with using the AWS SDK in your applications to interface with those other services' APIs.

There are some specific things you must do to get Spark to successfully talk to redshift:

  1. get the redshift jdbc, include it in your spark classpath and include the jar with the --jars flag.

  2. create a special role in IAM for redshift. that means start by creating the role, then choosing the redshift class / option at the beginning, so the primary resource is actually redshift, and then from there add your additional permissions.

  3. go into redshift and add that new role to your redshift cluster

  4. provide the role's ARN in your spark application

  5. make sure S3 is given permissions in that new role, because when spark and redshift talk to each other over JDBC, all the data is stored as an intermediate fileset in s3... like a temp swap file in S3.

    Note: if you get permissions errors about S3, try changing the protocol in the file path from s3:// to s3a:// -- for some reason that bypasses that security somehow. Source

After you do all of those things, then redshift and spark can talk to each other. its a lot of stuff.