Do the EMR 5.0 Spark clusters come preconfigured with the library and
access credentials for redshift access?
No , EMR doesn't provide this library from databricks.
To access Redshift :
Connectivity to Redshift doesn't require any IAM based authentication. It simply requires the EMR cluster (master/slave IP's or EMR master/slave SG's) whitelisted in Redshift's security group on its default port 5439 .
Now since the Spark executors run the COPY / LOAD commands based on spark commands and these commands require access to S3, you would need to configure IAM credentials mentioned here: https://github.com/databricks/spark-redshift#aws-credentials
To access S3 from EMR:
EMR nodes by default assume an IAM instance profile role called EMR_EC2_DefaultRole and permissions on this role define what EMR nodes and its entities (using InstanceProfileCredentialsProvider) can have access to. So you may use the 4th way mentioned in the documentation. The (AccessKey , SecretKey , Tokens) can be retrieved like the following and can be used as an Option or Parameter(temporary_aws_access_key_id , temporary_aws_secret_access_key , temporary_aws_session_token )
https://github.com/databricks/spark-redshift#parameters
//Get cred's from IAM instance profile
val provider = new InstanceProfileCredentialsProvider()
val credentials: AWSSessionCredentials = provider.getCredentials.asInstanceOf[AWSSessionCredentials]
val token = credentials.getSessionToken
val awsAccessKey = credentials.getAWSAccessKeyId
val awsSecretKey = credentials.getAWSSecretKey
The EMR_EC2_DefaultRole should have permissions to Read/Write on S3 object being used as tempdir.
Finally, EMR does include Redshift's JDBC drivers on /usr/share/aws/redshift/jdbc
which can be used in spark's driver and executor classpaths(--jars
).