2
votes

I'm trying to connect to Oracle DB from Spark SQL with following code:

 val dataTarget=sqlcontext.read.
  format("jdbc").
  option("driver", config.getString("oracledriver")).
  option("url", config.getString("jdbcUrl")).
  option("user", config.getString("usernameDH")).
  option("password", config.getString("passwordDH")).
  option("dbtable", targetQuery).
  option("partitionColumn", "ID").
  option("lowerBound", "5").
  option("upperBound", "499999").
  option("numPartitions", "10").
  load().persist(StorageLevel.DISK_ONLY)

By default when we connect with Oracle through Spark SQL it'll create one connection for one partition will be created for the entire RDD. This way I loose parallelism and performance issues comes when there is huge data in a Table. In my code I have passed option("numPartitions", "10") which will create 10 connection. Please correct if I'm wrong as I know, the number of connections with Oracle will be equal to the number of partitions we pass.

I'm getting below error if I use more connection because may be there is a connection limit to Oracle.

java.sql.SQLException: ORA-02391: exceeded simultaneous SESSIONS_PER_USER limit

To create more partitions for parallelism if I use more partitions, error comes but if I put less I face performance issues. Is there any other way to create a single connection and load data into multiple partitions (this will save my life).

Please suggest.

2

2 Answers

1
votes

Is there any other way to create a single connection and load data into multiple partitions

There is not. In general partitions are processed by different physical nodes and different virtual machines. Considering all the authorization and authentication mechanisms, you cannot just take connection and pass it from node to node.

If problem is just in exceeding SESSIONS_PER_USER just contact the DBA and ask for increasing the value for the Spark user.

If problem is throttling you can try to keep the same number partitions, but decrease number of Spark cores. Since this is mostly micromanaging it might be better to drop JDBC completely, use standard export mechanism (COPY FROM) and read the files directly.

0
votes

One work around might be to load the data using a single Oracle connection (partition) and then simply repartition:

val dataTargetPartitioned = dataTarget.repartition(100);

You can also partition by a field (if partitioning a dataframe):

val dataTargetPartitioned = dataTarget.repartition(100, "MY_COL");