In Spark PushdownQuery is processed by SQL Engine of the DB and with the result from it, dataframe is constructed. so, spark querying the results of that query.
val pushdownQuery = """(SELECT DISTINCT(FLIGHT_NUMBER) blah blah blah )"""
val dbDataFrame = spark.read.format("jdbc")
.option("url", url).option("dbtable", pushdownQuery).option("driver", driver)
.option("numPartitions", 4)
.option("partitionColumn", "COUNTRY_CODE")
.load()
I can see from another reference(https://dzone.com/articles/how-apache-spark-makes-your-slow-mysql-queries-10x) in spark - mysql, the parallelism in pushdown query is achieved by firing multiple query based on arguments numPartitions & partitionColumn. This is very similar to approach of how sqoop distributes. say for above given example of argument's numPartitions = 4 ; partitionColumn = COUNTRY_CODE and in our table COUNTRY_CODE value range falls on (000,999).
4 queries are build; fired to DB and dataframe is constructed from the result of those (with parallelism of 4 in this case).
Q1 : SELECT DISTINCT(FLIGHT_NUMBER) blah blah blah WHERE COUNTRY_CODE >= 000 AND COUNTRY_CODE <= 250
Q2 : SELECT DISTINCT(FLIGHT_NUMBER) blah blah blah WHERE COUNTRY_CODE > 250 AND COUNTRY_CODE <= 500
Q3 : SELECT DISTINCT(FLIGHT_NUMBER) blah blah blah WHERE COUNTRY_CODE > 500 AND COUNTRY_CODE <= 750
Q4 : SELECT DISTINCT(FLIGHT_NUMBER) blah blah blah WHERE COUNTRY_CODE > 750 AND COUNTRY_CODE <= 999
The question i have now is, how parallelism can be achieved with this method in spark (version 2.1) + hbase ( Query engine - BIGSQL) ? it's not giving me parallelism right now. Do driver's that bridge the spark-hbase need an update ? or spark needs to do that ? or what kind of change helps it getting that ? some direction with that helps me. Thank you !