How to connect to redshift data using Spark on Amazon EMR cluster

Question

I have an Amazon EMR cluster running. If I do

ls -l /usr/share/aws/redshift/jdbc/

it gives me

RedshiftJDBC41-1.2.7.1003.jar
RedshiftJDBC42-1.2.7.1003.jar

Now, I want to use this jar to connect to my Redshift database in my spark-shell . Here is what I do -

import org.apache.spark.sql._
val sqlContext = new SQLContext(sc)


val df : DataFrame = sqlContext.read
.option("url","jdbc:redshift://host:PORT/DB-name?user=user&password=password")
.option("dbtable","tablename")
.load()

and I get this error -

org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;

I am not sure if I am specifying the correct format while reading the data. I have also read that spark-redshift driver is available but I do not want to run spark-submit with extra JARS.

How do I connect to redshift data from Spark-shell ? Is that the correct JAR to configure the connection in Spark ?

DemetriKots DemetriKots · Accepted Answer · 2018-09-12T21:58:51

The error being generated is because you are missing the .format("jdbc") in your read. It should be:

val df : DataFrame = sqlContext.read
  .format("jdbc")
  .option("url","jdbc:redshift://host:PORT/DB-name?user=user&password=password")
  .option("dbtable","tablename")
  .load()

By default, Spark assumes sources to be Parquet files, hence the mention of Parquet in the error.

You may still run into issues with classpath/finding the drivers, but this change should give you more useful error output. I assume that folder location you listed is in the classpath for Spark on EMR and those driver versions look to be fairly current. Those drivers should work.

Note, this will only work for reading from Redshift. If you need to write to Redshift your best bet is using the Databricks Redshift data source for Spark - https://github.com/databricks/spark-redshift.

How to connect to redshift data using Spark on Amazon EMR cluster

1 Answers