I am connecting to oracle database
using JDBC connection
using Spark
and trying to read an oracle table containing 40 million rows. I am using 30 executors , 5 executor cores and 4g memory for each executors while launching spark-shell/submit. While reading the count or trying to write data of the dataframe, its using only one executor to read/write the data from oracle. Tried re partitioning the dataframe but still using only 1 executor causing huge performance degradation.
Below is the syntax used, any suggestion is highly appreciated.
Command snippet:-
spark-shell --executor-memory 4G --executor-cores 5 --num-executors 30
val source_df = spark.read.format("jdbc").option("url", JDBC_URL).option("dbtable", src_table).option("user", *****).option("password", *****).option("driver", "oracle.jdbc.driver.OracleDriver").option("numPartitions", 40).option("partitionColumn", "*****").option("lowerBound", 1).option("upperBound", 100000).load()
val df_1_msag=source_table_DF_raw_msag.repartition(40)
df_1_msag.count
[Stage 0:=======================================================> (39 + 1
Stage-0
is doing aselect count(*)
to determine how to divide your range into partitions. That will be done by a single task. – mazaneicha