Connecting to AWS Redshift with Zeppelin Spark 2.0 and Pyspark

Question

I need to read Redshift data into dataframes in Zeppelin. For the last several months I've been using Spark 2.0 via Zeppelin on AWS to open csv and json S3 files successfully.

I used to be able to connect to Redshift from Zeppelin on AWS EMR with Spark 1.6.2 (maybe 1.6.1), using this code:

%pyspark

from pyspark.sql import SQLContext, Row
import sys
from pyspark.sql.window import Window
import pyspark.sql.functions as func

#Load the data
aquery = "(SELECT serial_number, min(date_time) min_date_time from schema.table where serial_number in ('abcdefg','1234567') group by serial_number) as minDates"

dfMinDates = sqlContext.read.format('jdbc').options(url='jdbc:postgresql://dadadadaaaredshift.amazonaws.com:5439/idw?tcpKeepAlive=true&ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory?user=user&password=password', dbtable=aquery).load()
dfMinDates.show()

and it worked. That was summer of 2016.

I haven't had need of it since then and now AWS has Spark 2.0.

The new syntax is

myDF = spark.read.jdbc like this:

%pyspark

aquery = "(SELECT serial_number, min(date_time) min_date_time from schema.table where serial_number in ('abcdefg','1234567') group by serial_number) as minDates"

dfMinDates = spark.read.jdbc("jdbc:postgresql://dadadadaaaredshift.amazonaws.com:5439/idw?tcpKeepAlive=true&ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory?user=user&password=password", dbtable=aquery).load()
dfMinDates.show()

but I get this error:

Py4JJavaError: An error occurred while calling o119.jdbc. : java.sql.SQLException: No suitable driver at java.sql.DriverManager.getDriver(DriverManager.java:315) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:54) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$2.apply(JdbcUtils.scala:54) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createConnectionFactory(JdbcUtils.scala:53) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRDD$.resolveTable(JDBCRDD.scala:123) at org.apache.spark.sql.execution.datasources.jdbc.JDBCRelation.(JDBCRelation.scala:117) at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:237) at org.apache.spark.sql.DataFrameReader.jdbc(DataFrameReader.scala:159) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:280) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:211) at java.lang.Thread.run(Thread.java:745) (, Py4JJavaError(u'An error occurred while calling o119.jdbc.\n', JavaObject id=o121), )

I researched the Spark 2.0 documentation, and found this:

The JDBC driver class must be visible to the primordial class loader on the client session and on all executors. This is because Java’s DriverManager class does a security check that results in it ignoring all drivers not visible to the primordial class loader when one goes to open a connection. One convenient way to do this is to modify compute_classpath.sh on all worker nodes to include your driver JARs.

I don't know how to implement this and did more reading from various posts, some blogs and some posts in stackoverflow and found this:

spark.driver.extraClassPath = org.postgresql.Driver

I did this in the Interpreter settings page of Zeppelin, but still I get the same error.

I tried to add a Postgres Interpreter, and I'm not sure I did it right (because I wasn't sure whether to put it in the Spark interpreter or Python interpreter), and I chose the Spark interpreter. Now the Postgres interpreter also has all the same settings as the Spark interpreter, which might not matter, but still I get the same error.

In Spark 1.6, I just don't remember going through all this trouble.

As an experiment, I spun up an EMR cluster with Spark 1.6.2 and tried the old code that used to work, and got the same error as above!

The Zeppelin site has Postgres covered but their information looks like code rather than how to set up the interpreters, so I don't know how to use it.

I'm out of ideas and references.

Any suggestions are much appreciated!

Joe Harris Joe Harris · Accepted Answer · 2017-01-26T17:02:41

You need to use Amazon's Redshift specific driver. You can download it from here: http://docs.aws.amazon.com/redshift/latest/mgmt/configure-jdbc-connection.html.

However, if you're using EMR it's already in place (at /usr/share/aws/redshift/jdbc/RedshiftJDBC41.jar) and you can just tell Zeppelin where it is.

Here's how to declare it: AWS Redshift driver in Zeppelin

Connecting to AWS Redshift with Zeppelin Spark 2.0 and Pyspark

1 Answers