0
votes

I am connecting to oracle database using JDBC connection using Spark and trying to read an oracle table containing 40 million rows. I am using 30 executors , 5 executor cores and 4g memory for each executors while launching spark-shell/submit. While reading the count or trying to write data of the dataframe, its using only one executor to read/write the data from oracle. Tried re partitioning the dataframe but still using only 1 executor causing huge performance degradation.

Below is the syntax used, any suggestion is highly appreciated.

Command snippet:-

spark-shell --executor-memory 4G --executor-cores 5 --num-executors 30
val source_df = spark.read.format("jdbc").option("url", JDBC_URL).option("dbtable", src_table).option("user", *****).option("password", *****).option("driver", "oracle.jdbc.driver.OracleDriver").option("numPartitions", 40).option("partitionColumn", "*****").option("lowerBound", 1).option("upperBound", 100000).load()

val df_1_msag=source_table_DF_raw_msag.repartition(40)
df_1_msag.count

[Stage 0:=======================================================> (39 + 1
1
I'm guessing Stage-0 is doing a select count(*) to determine how to divide your range into partitions. That will be done by a single task.mazaneicha

1 Answers

0
votes

The number of concurrent connections allowed for the user in oracle db is also important.

https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html

numPartitions
The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing.