How to connect to Redshift from AWS Glue (PySpark)?

Question

I am trying to connect to Redshift and run simple queries from a Glue DevEndpoint (that is requirement) but can not seems to connect.

Following code just times out:

df = spark.read \
  .format('jdbc') \
  .option("url", "jdbc:redshift://my-redshift-cluster.c512345.us-east-2.redshift.amazonaws.com:5439/dev?user=myuser&password=mypass") \
  .option("query", "select distinct(tablename) from pg_table_def where schemaname = 'public'; ") \
  .option("tempdir", "s3n://test") \
  .option("aws_iam_role", "arn:aws:iam::147912345678:role/my-glue-redshift-role") \
  .load()

What could be the reason?

I checked URL, user, password and also tried different IAM roles but every time just hangs..

Also tried without IAM role (just having URL, user/pass, schema/table that already exists there) and also hangs/timeout:

jdbcDF = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:redshift://my-redshift-cluster.c512345.us-east-2.redshift.amazonaws.com:5439/dev") \
    .option("dbtable", "public.test") \
    .option("user", "myuser") \
    .option("password", "mypass") \
    .load()

Reading data (directly in Glue SSH terminal) from S3 or from Glue tables (catalog) seems fine so I know that Spark and Dataframes are fine, just there is something with connection to RedShift but not sure what?

do you have glue connection for it? you may have to create glue connection, test it and then add it to your glue job — Sandeep Fatangare

Sandeep Fatangare Sandeep Fatangare · Accepted Answer · 2019-10-12T05:29:40

Select last option while creating glue job. And in next screen, it will ask to select Glue connection

How to connect to Redshift from AWS Glue (PySpark)?

2 Answers