Get data from subfolders of an unpartitioned hive table into a dataframe in spark

Question

There is an external table in hive pointing to s3 location that is not partitioned. The table points to a folder in s3 but the data is in multiple subfolders inside that folder.

This table can be queried even though the table is not partitioned by setting few properties in hive like below, set hive.input.dir.recursive=true; set hive.mapred.supports.subdirectories=true; set hive.supports.subdirectories=true; set mapred.input.dir.recursive=true;

However, when the same table is used in spark to load the data into a dataframe using a sql statement like df = sqlContext.sql("select * from table_name"), the action fails saying 'The subfolders in the external s3 location is not a file'.

I tried setting above hive properties in spark using sc.hadoopConfiguration.set("mapred.input.dir.recursive","true") method, but it did not help. Looks like this would help only for sc.textFile kind of loading.

Adiga Adiga · Accepted Answer · 2017-10-16T06:34:21

This can be achieved by setting the following property in spark, sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true")

Note here that the property is set usign sqlContext instead of sparkContext. And I tested this in spark 1.6.2

Get data from subfolders of an unpartitioned hive table into a dataframe in spark

1 Answers