0
votes

We have spark job but also randomly run hive query in current hadoop cluster

I have seen the same hive table has different partition pattern like below:

i.e. if the table is partition by date, so

hdfs dfs -ls /data/hive/warehouse/db_name/table_name/part_date=2019-12-01/

gave result

/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-00001
....
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06669
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06670

however if find data from different partition date

hdfs dfs -ls /data/hive/warehouse/db_name/table_name/part_date=2020-01-01/

list files with different name patter

/data/hive/warehouse/db_name/table_name/part_date=2020-01-01/000007_0
/data/hive/warehouse/db_name/table_name/part_date=2020-01-01/000008_0
....
/data/hive/warehouse/db_name/table_name/part_date=2020-01-01/000010_0

What I can tell the difference not only in one partition the data files come with part- prefix and the other is like 00000n_0, also there are a lot more amount of files for part- file but each file is quite small.

I also found aggregation on part- files are a lot slower than 00000n_0 files

what could be the possible cause of the file pattern difference and what could be the configuration to change from one to another?

1

1 Answers

1
votes

When spark streaming writes data in Hive it creates lots of small files named as part- in Hive and which keep on the increase. This will give performance issue while querying on Hive table. Hive takes too much time to give result due to large no of small files in the partition.

When spark job write data in Hive it looks like -

/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-00001
....
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06669
/data/hive/warehouse/db_name/table_name/part_date=2019-12-01/part-06670

But here different file pattern is due to compaction logic on the partition's file to compact the small file into a large. Here n in 00000n_0 is the no of reducer.

Sample compaction script, which compacts the small file into a big file within partition for example table under-sample database -

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.reducers.bytes.per.reducer=268435456; --256MB reducer size.

CREATE TABLE example_tmp
     STORED AS parquet
        LOCATION '/user/hive/warehouse/sample.db/example_tmp'
AS
  SELECT * FROM example

INSERT OVERWRITE table sample.example PARTITION (part_date) select * from sample.example_tmp;

DROP TABLE IF EXISTS sample.example_tmp PURGE;

The above script will compact the small files into some big file within the partition. And filename will be 00000n_0

what could be the possible cause of the file pattern difference and what could be the configuration to change from one to another?

There might be someone run compaction logic on the partition using Hive. Or might be reload the partition data using Hive. This is not an issue, data remains the same.