How to stop load the whole table in spark?

Question

The thing is, I have read right to one table,which is partition by year month and day.But I don't have right read the data from 2016/04/24. when I execute in Hive command:

hive>select * from table where year="2016" and month="06" and day="01";

I CAN READ OTHER DAYS' DATA EXCEPT 2016/04/24

But,when I read in spark

sqlContext.sql.sql(select * from table where year="2016" and month="06" and day="01")

exceptition is throwable That I dont have the right to hdfs/.../2016/04/24

THIS SHOW SPARK SQL LOAD THE WHOLE TABLE ONCE AND THEN FILTER? HOW CAN I AVOID LOAD THE WHOLE TABLE?

Furkan Varol Furkan Varol · Accepted Answer · 2016-07-21T07:06:03

You can use JdbcRDDs directly. With it you can bypass spark sql engine therefore your queries will be directly sent to hive.

To use JdbcRDD you need to create hive driver and register it first (of course it is not registered already).

val driver = "org.apache.hive.jdbc.HiveDriver"
Class.forName(driver)

Then you can create a JdbcRDD;

val connUrl = "jdbc:hive2://..."
val query = """select * from table where year="2016" and month="06" and day="01" and ? = ?"""
val lowerBound = 0
val upperBound = 0
val numOfPartitions = 1
new JdbcRDD(
    sc,
    () => DriverManager.getConnection(connUrl),
    query,
    lowerBound,
    upperBound,
    numOfPartitions,
    (r: ResultSet) => (r.getString(1) /** get data here or with a function**/)
)

JdbcRDD query must have two ? in order to create partition your data. So you should write a better query than me. This just creates one partition to demonstrate how it works.

However, before doing this I recommend you to check HiveContext. This supports HiveQL as well. Check this.

How to stop load the whole table in spark?

1 Answers