I have an Amazon EMR cluster running. If I do
ls -l /usr/share/aws/redshift/jdbc/
it gives me
RedshiftJDBC41-1.2.7.1003.jar
RedshiftJDBC42-1.2.7.1003.jar
Now, I want to use this jar
to connect to my Redshift database
in my spark-shell
. Here is what I do -
import org.apache.spark.sql._
val sqlContext = new SQLContext(sc)
val df : DataFrame = sqlContext.read
.option("url","jdbc:redshift://host:PORT/DB-name?user=user&password=password")
.option("dbtable","tablename")
.load()
and I get this error -
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
I am not sure if I am specifying the correct format
while reading the data. I have also read that spark-redshift driver
is available but I do not want to run spark-submit
with extra JARS
.
How do I connect to redshift data from Spark-shell ? Is that the correct JAR to configure the connection in Spark ?