Extract data from Relational database using Spark(parallel) without integer column?

Question

Databricks documentation mentions that we have to provide 'partitionColumn, lowerBound, upperBound and numPartitions' values while reading data from a relational database across mulitple workers. It is also mentioned that the partitionColumn should be a integer column. How can I then make parallel reads from table that doesnt have integer column?

Databricks Documentation

I tried using 'rownum'(source is Oracle DB) as the partition column but when i try to store the results in a dataframe, i get only the data from the first partition.

Here is the code:

jdbcUrl = "jdbc:oracle:thin:@//{0}:{1}/{2}".format(hostname, port, db)



connectionProperties = {
  "user" : 'XXXXXX',
  "password" : 'XXXXXX',
  "driver" : "oracle.jdbc.driver.OracleDriver",
  "oracle.jdbc.timezoneAsRegion" : "false"
}

parallel_df = spark.read.jdbc(url=jdbcUrl,
                          table=table_name, 
                          column='rownum', 
                          lowerBound=1, 
                          upperBound=200000, 
                          numPartitions=20,
                          properties=connectionProperties)

When i get the count of parallel_df, i get only 200000/20 = 10000 rows. Can anyone provide any insight on how to do this parallel read?

You can use predicates but exact usage will depend on the data. — Alper t. Turker

gtosto gtosto · Accepted Answer · 2018-07-10T08:03:45

When i get the count of parallel_df, i get only 200000/20 = 10000 rows. Can anyone provide any insight on how to do this parallel read?

when you read dataframe in that way (i.e. with partititons) spark will do a query for each partition, in your case something like

select t.* from table t where rownum between (1, 10000)  on executor 1
select t.* from table t where rownum between (10001, 20000) on execuotr 2

and so on ...

from the oracle point of view queries are unrelated and rownum (as usual) will always starts from 1, so you get only first 10000 rows coming from the first query.

Databricks documentation mentions that we have to provide 'partitionColumn, lowerBound, upperBound and numPartitions' values while reading data from a relational database across mulitple workers.

True, but you don't have to do that at any cost. If your data doesn't have a column suitable for data partitioning/splitting then just don't use that feature

val df = spark.read.jdbc(url=jdbcUrl, table=table_name, properties=connectionProperties)

Extract data from Relational database using Spark(parallel) without integer column?

1 Answers