2
votes

I am trying to connect to Redshift and run simple queries from a Glue DevEndpoint (that is requirement) but can not seems to connect.

Following code just times out:

df = spark.read \
  .format('jdbc') \
  .option("url", "jdbc:redshift://my-redshift-cluster.c512345.us-east-2.redshift.amazonaws.com:5439/dev?user=myuser&password=mypass") \
  .option("query", "select distinct(tablename) from pg_table_def where schemaname = 'public'; ") \
  .option("tempdir", "s3n://test") \
  .option("aws_iam_role", "arn:aws:iam::147912345678:role/my-glue-redshift-role") \
  .load()

What could be the reason?

I checked URL, user, password and also tried different IAM roles but every time just hangs..

Also tried without IAM role (just having URL, user/pass, schema/table that already exists there) and also hangs/timeout:

jdbcDF = spark.read \
    .format("jdbc") \
    .option("url", "jdbc:redshift://my-redshift-cluster.c512345.us-east-2.redshift.amazonaws.com:5439/dev") \
    .option("dbtable", "public.test") \
    .option("user", "myuser") \
    .option("password", "mypass") \
    .load()

Reading data (directly in Glue SSH terminal) from S3 or from Glue tables (catalog) seems fine so I know that Spark and Dataframes are fine, just there is something with connection to RedShift but not sure what?

2
do you have glue connection for it? you may have to create glue connection, test it and then add it to your glue jobSandeep Fatangare
I have a connector but how do you 'add it to a job'?Joe
How about moving it to private subnet?Franxi Hidro

2 Answers

0
votes

Create Glue Job

Select last option while creating glue job. And in next screen, it will ask to select Glue connection

0
votes

You seem to be on the correct path. I connect and query Redshift from Glue PySpark job the same way except a minor change of using

.format("com.databricks.spark.redshift") 

I have also successfully used

.option("forward_spark_s3_credentials", "true")

instead of

.option("iam_role", "my_iam_role")