2
votes

I am trying to read a table from PostgreSQL 9.6 into Spark 2.1.1 in an RDD, for which I have the following code in Scala.

import org.apache.spark.rdd.JdbcRDD
import java.sql.DriverManager
import org.apache.spark.SparkContext

val sc = SparkContext.getOrCreate()

val rdd = new org.apache.spark.rdd.JdbcRDD(
    sc,
    () => {DriverManager.getConnection(
    "jdbc:postgresql://my_host:5432/my_db", "my_user", "my_pass")},
    sql = "select * from my_table",
    0, 100000, 2)

However, it is returning the following error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 7, 10.0.0.13, executor 1): org.postgresql.util.PSQLException: The column index is out of range: 1, number of columns: 0.

I am using the latest PostgreSQL JDBC driver and I have checked it is authenticating correctly agaisnt the database.

Any ideas why this might be happening or any alternatives I can try?

1

1 Answers

3
votes

From spark documentation

The query must contain two ? placeholders for parameters used to partition the results

and

lowerBound the minimum value of the first placeholder param; upperBound the maximum value of the second placeholder

So your query should look more like

select * from my_table where ? <= id and id <= ?