5
votes

I have a requirement to insert streaming of records into Hive partitioned table. The table structure is something like

CREATE TABLE store_transation (
     item_name string,
     item_count int,
     bill_number int,
) PARTITIONED BY (
   yyyy_mm_dd string
);

I would like to understand how Hive handles inserts in the internal table.

Does all record insert into a single file inside the yyyy_mm_dd=2018_08_31 directory? Or hive splits into multiple files inside a partition, if so when?

Which one performs well from the following in case of per day 1 million records and the querying pattern is going to be between date range?

  1. No partition in the internal table
  2. Partition by date, each date has only one file
  3. Partition by date, each date has more than one file
1

1 Answers

4
votes

Insert will perform the same in all your cases because insert does not not query your existing data (unless you are inserting using select from itself) and normally each container will create it's own file unless file merge is configured.

Querying the data using date range will perform better if the table is partitioned by date. Too many files may cause performance degradation, so you may want to merge files during insert. What is too many files? Like hundreds or even thousands per daily partition. Having few files per partition will not cause performance problems, you do not need to merge them.

Insert statement will create additional file(s) in the partitions directory and normally will not merge with existing files. How many files will be created depends on the insert statement and configuration settings.

The number of final mappers or reducers running + configuration settings will determine the number of output files. You can force it running on single reducer by, for example, adding 'order by' clause. In such case one additional file per partition will be created but it will work slow. Also distribute by partition key can be used to reduce the number of files created, but this will trigger additional reducer stage and will work slower than map-only task.

Also you can merge new files with existing using these settings:

SET hive.merge.mapfiles=true;
SET hive.merge.mapredfiles=true;
SET hive.merge.size.per.task=128000000; -- (128MB)
SET hive.merge.smallfiles.avgsize=128000000; -- (128MB)

These configuration settings may trigger merge task (depending on sizes configured in above settings) at the end and it will merge existing files as well as newly added by insert.

See this answer for more details about merge: https://stackoverflow.com/a/45266244/2700344

Actually the type of table, managed or external, does not matter in this context. Insert or select will work the same.

If you already have files in the same format as target table, then the fastest way is to put them in partitions directories without using DML query at all.

For ORC files you can merge files efficiently using this command: ALTER TABLE T [PARTITION partition_spec] CONCATENATE;