spark sql load all data into memory without using where clause

Question

so I have a really huge table contains billions of rows, I tried the Spark DataFrame API to load data, here is my code:

sql = "select * from mytable where day = 2016-11-25 and hour = 10"
df = sqlContext.read \
    .format("jdbc") \
    .option("driver", driver) \
    .option("url", url) \
    .option("user", user) \
    .option("password", password) \
    .option("dbtable", table) \
    .load(sql)

df.show()

I typed the sql in mysql, it returns about 100 rows,but the above sql did not work in spark sql, it occurs OOM error, it seems like that spark sql load all data into memory without using where clause. So how can spark sql using where clause?

try running this : df = sqlContext.read \ .format("jdbc") \ .option("driver", driver) \ .option("url", url) \ .option("user", user) \ .option("password", password) \ .option("dbtable", table) \ .load("select * from mytable").where($"day"==="2016-11-25" and $"hour" === 10) — Shiv4nsh
Possible duplicate of In Apache Spark 2.0.0, is it possible to fetch a query from an external database (rather than grab the whole table)? — user6022341

wxweven wxweven · Accepted Answer · 2016-12-09T04:18:51

I have solved the problem. the spark doc gives the answer: spark doc

So the key is to change the "dbtalble" option, make your sql a subquery. The correct answer is :

// 1. write your query sql as a subquery
sql = "(select * from mytable where day = 2016-11-25 and hour = 10) t1"
df = sqlContext.read \
    .format("jdbc") \
    .option("driver", driver) \
    .option("url", url) \
    .option("user", user) \
    .option("password", password) \
    .option("dbtable", sql) \ // 2. change "dbtable" option to your subquery sql
    .load(sql)

df.show()

spark sql load all data into memory without using where clause

2 Answers