We have spark job but also randomly run hive query in current hadoop cluster
I have seen the same hive table has different partition pattern like below:
i.e. if the table is partition by date, so
hdfs dfs -ls /data/hive/warehouse/db_name/table_name/part_date=2019-12-01/
gave result
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-00001
....
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06669
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06670
however if find data from different partition date
hdfs dfs -ls /data/hive/warehouse/db_name/table_name/part_date=2020-01-01/
list files with different name patter
/data/hive/warehouse/db_name/table_name/part_date=2020-01-01/000007_0
/data/hive/warehouse/db_name/table_name/part_date=2020-01-01/000008_0
....
/data/hive/warehouse/db_name/table_name/part_date=2020-01-01/000010_0
What I can tell the difference not only in one partition the data files come with part-
prefix and the other is like 00000n_0
, also there are a lot more amount of files for part-
file but each file is quite small.
I also found aggregation on part-
files are a lot slower than 00000n_0
files
what could be the possible cause of the file pattern difference and what could be the configuration to change from one to another?