I have an Amazon EMR cluster running. If I do
ls -l /usr/share/aws/redshift/jdbc/
it gives me
RedshiftJDBC41-1.2.7.1003.jar
RedshiftJDBC42-1.2.7.1003.jar
Now, I want to use this jar to connect to my Redshift database in my spark-shell . Here is what I do -
import org.apache.spark.sql._
val sqlContext = new SQLContext(sc)
val df : DataFrame = sqlContext.read
.option("url","jdbc:redshift://host:PORT/DB-name?user=user&password=password")
.option("dbtable","tablename")
.load()
and I get this error -
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
I am not sure if I am specifying the correct format while reading the data. I have also read that spark-redshift driver is available but I do not want to run spark-submit with extra JARS.
How do I connect to redshift data from Spark-shell ? Is that the correct JAR to configure the connection in Spark ?