0
votes

so I have a really huge table contains billions of rows, I tried the Spark DataFrame API to load data, here is my code:

sql = "select * from mytable where day = 2016-11-25 and hour = 10"
df = sqlContext.read \
    .format("jdbc") \
    .option("driver", driver) \
    .option("url", url) \
    .option("user", user) \
    .option("password", password) \
    .option("dbtable", table) \
    .load(sql)

df.show()

I typed the sql in mysql, it returns about 100 rows,but the above sql did not work in spark sql, it occurs OOM error, it seems like that spark sql load all data into memory without using where clause. So how can spark sql using where clause?

2
try running this : df = sqlContext.read \ .format("jdbc") \ .option("driver", driver) \ .option("url", url) \ .option("user", user) \ .option("password", password) \ .option("dbtable", table) \ .load("select * from mytable").where($"day"==="2016-11-25" and $"hour" === 10)Shiv4nsh

2 Answers

3
votes

I have solved the problem. the spark doc gives the answer: spark doc

spark doc

So the key is to change the "dbtalble" option, make your sql a subquery. The correct answer is :

// 1. write your query sql as a subquery
sql = "(select * from mytable where day = 2016-11-25 and hour = 10) t1"
df = sqlContext.read \
    .format("jdbc") \
    .option("driver", driver) \
    .option("url", url) \
    .option("user", user) \
    .option("password", password) \
    .option("dbtable", sql) \ // 2. change "dbtable" option to your subquery sql
    .load(sql)

df.show()
0
votes
sql = "(select * from mytable where day = 2016-11-25 and hour = 10) as t1"
    df = sqlContext.read
    .format("jdbc")
    .option("driver", driver)
    .option("url", url)
    .option("user", user)
    .option("password", password)
    .option("dbtable", sql)
    .load(sql)

df.show()