PySpark - read recursive Hive table

Question

I have a Hive table that has multiple sub-directories in HDFS, something like:

/hdfs_dir/my_table_dir/my_table_sub_dir1
/hdfs_dir/my_table_dir/my_table_sub_dir2
...

Normally I set the following parameters before I run a Hive script:

set hive.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;
set hive.supports.subdirectories=true;
set mapred.input.dir.recursive=true;

select * from my_db.my_table;

I'm trying to do the same using PySpark,

conf = (SparkConf().setAppName("My App")
        ...
        .set("hive.input.dir.recursive", "true")
        .set("hive.mapred.supports.subdirectories", "true")
        .set("hive.supports.subdirectories", "true")
        .set("mapred.input.dir.recursive", "true"))

sc = SparkContext(conf = conf)

sqlContext = HiveContext(sc)

my_table = sqlContext.sql("select * from my_db.my_table")

and end up with an error like:

java.io.IOException: Not a file: hdfs://hdfs_dir/my_table_dir/my_table_sub_dir1

What's the correct way to read a Hive table with sub-directories in Spark?

user3124185 user3124185 · Accepted Answer · 2016-07-06T20:29:19

Try setting them through ctx.sql() prior to execute the query:

sqlContext.sql("SET hive.mapred.supports.subdirectories=true")
sqlContext.sql("SET mapreduce.input.fileinputformat.input.dir.recursive=true")
my_table = sqlContext.sql("select * from my_db.my_table")

PySpark - read recursive Hive table

3 Answers