Spark not able to read hive table because of _1 and _2 sub folders in S3

Question

I have below 3 hive tables with same structure.

``` drop table default.test1; CREATE EXTERNAL TABLE default.test1(
c1 string,
c2 string,
c3 string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://s3_bucket/dev/dev/testspark/test1/';

drop table default.test2; CREATE EXTERNAL TABLE default.test2(
c1 string,
c2 string,
c3 string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://s3_bucket/dev/dev/testspark/test2/';

drop table default.test3; CREATE EXTERNAL TABLE default.test3(
c1 string,
c2 string,
c3 string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://s3_bucket/dev/dev/testspark/test3/';

hive>insert into default.test1 values("a","b","c"); hive>insert into default.test2 values("d","e","f"); hive>insert overwrite table default.test3 select * from default.test1 UNION ALL select * from default.test2;```

Once after I loaded data by using UNION ALL of test1 and test2. test3 table s3 path is having the data in sub folders like below. PRE 1/
PRE 2/

When I query the test3 table from hive it will give the result of the data which was inserted. But when I query the same in spark. It is getting zero count. pyspark shell:

sqlContext.sql("select * from default.test3").count() 0

How to fix this issue ?

Just got a thought, If I can make hive to write the output to one single file rather that 2 sub directories, I should be able to read the output table through Spark. — Ramesh
set hive.exec.reducers.max=1; I set this property in hive. and executed the insert overwrite stmt. I still see sub folders created. So the issue is not resolved yet. — Ramesh

Ramesh.kbvr Ramesh.kbvr · Accepted Answer · 2018-05-23T01:05:35

Try setting up the below properties before running sqlContext.sql

sqlContext.setConf("mapred.input.dir.recursive","true"); sqlContext.setConf("mapreduce.input.fileinputformat.input.dir.recursive","true");

Spark not able to read hive table because of _1 and _2 sub folders in S3

2 Answers