1
votes

I have a table partitioned on year,month,day and hour. If I use the following INSERT OVERWRITE to a specific partition it places a file under appropriate directory structure. This file contains the string abc:-

INSERT OVERWRITE TABLE testtable PARTITION(year = 2017, month = 7, day=29, hour=18)
SELECT tbl.c1 FROM  
(
select 'abc' as c1
) as tbl;

But if I use the following statement, Hive surprisingly creates three new folders under the folder "hour=18".

enter image description here

And there is a file inside each of these three subfolders.

INSERT OVERWRITE TABLE testtable PARTITION(year = 2017, month = 7, day=29, hour=18)
SELECT tbl.c1 FROM  
(
select 'abc' as c1
union ALL
select 'xyz' as c1
union ALL
select 'mno' as c1
) as tbl; 

When I query the data, it shows the data as expected. But why did it create these 3 new folders? Since the partitioning scheme is only for year,month,day and hour I wouldn't expect Hive to create folders for anything other than these.

3

3 Answers

3
votes

Actually it has nothing to do with INSERT OVERWRITE or partitioning.

It's UNION ALL statement that adds additional directories. Why it bothers you?

You can do some DISTRIBUTE BY shenanigans or set number of reducers to 1 to put this into one file.

1
votes

Hi guys I had the same issue and thought of sharing.

Union all adds extra subfolder in the table.

The count(*) on the table will give 0 records and the msck repair will error out with the default properties. After using set hive.msck.path.validator=ignore; MSCK will not error out but will message "Partitions not in metastore"

Only after setting the properties as mentioned above by DogBoneBlues (SET hive.mapred.supports.subdirectories=TRUE; SET mapred.input.dir.recursive=TRUE;) The table is returning values.(count(*))

-1
votes

You can use just "union" instead of "union all" if you dont care about duplicates. "union" should not create sub-folders.