Does the default EMR Spark come preconfigured to directly access redshift tables?

Question

Reading/writing operations between EMR Spark clusters and redshift can definitely be done via. an intermediary data dump to s3.

There are spark libraries, however, which can directly treat redshift as a datasource: https://github.com/databricks/spark-redshift

Do the EMR 5.0 Spark clusters come preconfigured with the library and access credentials for redshift access?

see updated answer. i just did all of this work freshly yesterday and decided to rewrite my answer to be more specific. — Kristian

jc mannem jc mannem · Accepted Answer · 2016-10-13T22:34:05

Do the EMR 5.0 Spark clusters come preconfigured with the library and access credentials for redshift access?

No , EMR doesn't provide this library from databricks.

To access Redshift : Connectivity to Redshift doesn't require any IAM based authentication. It simply requires the EMR cluster (master/slave IP's or EMR master/slave SG's) whitelisted in Redshift's security group on its default port 5439 .

Now since the Spark executors run the COPY / LOAD commands based on spark commands and these commands require access to S3, you would need to configure IAM credentials mentioned here: https://github.com/databricks/spark-redshift#aws-credentials

To access S3 from EMR:

EMR nodes by default assume an IAM instance profile role called EMR_EC2_DefaultRole and permissions on this role define what EMR nodes and its entities (using InstanceProfileCredentialsProvider) can have access to. So you may use the 4th way mentioned in the documentation. The (AccessKey , SecretKey , Tokens) can be retrieved like the following and can be used as an Option or Parameter(temporary_aws_access_key_id , temporary_aws_secret_access_key , temporary_aws_session_token )

https://github.com/databricks/spark-redshift#parameters

//Get cred's from IAM instance profile
val provider = new InstanceProfileCredentialsProvider()
val credentials: AWSSessionCredentials = provider.getCredentials.asInstanceOf[AWSSessionCredentials]
val token = credentials.getSessionToken
val awsAccessKey = credentials.getAWSAccessKeyId
val awsSecretKey = credentials.getAWSSecretKey

The EMR_EC2_DefaultRole should have permissions to Read/Write on S3 object being used as tempdir.

Finally, EMR does include Redshift's JDBC drivers on /usr/share/aws/redshift/jdbc which can be used in spark's driver and executor classpaths(--jars).

Does the default EMR Spark come preconfigured to directly access redshift tables?

2 Answers