3
votes

I am trying to connect to Redshift from Spark (running on Databricks)

from pyspark.sql import SQLContext

sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", ACCESS_KEY)
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", SECRET_KEY)

# IP addresses from Redshift Security Group panel
IP_ADDRESSES_TO_ADD = ["1.2.3.4/32", "5.6.7.8/32"]
PORTS_TO_ADD = ["80", "443"]
PROTOCOLS_TO_ADD = ["tcp"]

# Read data from a query
df = sqlContext.read \
    .format("com.databricks.spark.redshift") \
    .option("url", "jdbc:redshift://XXX.XXX.eu-west-1.redshift.amazonaws.com:5439/REDSHIFT_DB?user=REDSHIFT_USER&password=REDSHIFT_PW&ssl=true&sslfactory=com.amazon.redshift.ssl.NonValidatingFactory") \
    .option("query", "select * FROM REDSHIFT_TABLE LIMIT 10") \
    .option("tempdir", "s3n://path/to/temp/") \
    .load()

However I am getting the following error:

java.sql.SQLException: [Amazon](500150) Error setting/closing connection: Connection timed out.

Am I missing something?

2
Check ingress rules of the redshift, and make sure it allows access from ip where you running script. - Vor

2 Answers

1
votes

It looks like a connection error. Please verify whether you are authorized user or not.

To verify this : Run below command :

telnet XXX.XXX.eu-west-1.redshift.amazonaws.com 5439

You should get something like this (If you are authorized user):

Trying <IP address>...
Connected to <Host name>.
Escape character is '^]'.

But if you will get : connection time out, it means you are not authorized user.

0
votes

How are you spinning up the databricks cluster nodes? Is it on demand? Every time the cluster terminates, you get a new set of IP addresses (EC2 instances) the next time you start the cluster. So, you need to make sure that the newly assigned IP addresses are whitelisted to access redshift (Inbound rules)