3
votes

I am trying to understand how exactly the ALTER TABLE CONCATENATE in HIVE Works.

I saw this link How does Hive 'alter table <table name> concatenate' work? but all I got from this links is that for ORC Files, the merge happens at a stripe level.

I am looking for a detailed explanation of how CONCATENATE works. As an e.g I initially had 500 small ORC Files in the HDFS. I ran the Hive ALTER TABLE CONCATENATE and the files merged to 27 bigger files. Subsequent runs of CONCATENATE reduced the number of files to 16 and finally I ended up in two large files.( used version Hive 0.12 ) So I wanted to understand

  1. How exactly CONCATENATE works? Does it looks at the existing number of files , as well as the size ? How will it determine the no: of output ORC files after concatenation?

  2. Is there any known issues with using the Concatenate ? We are planning to run the concatenate one a day in the maintenance window

  3. Is Using CTAS an alternative to concatenate and which is better? Note that my requirement is to reduce the no of ORC files (ingested through Nifi) without compromising performance of Read

Any help is appreciated and thanks in advance

2
Thank you Brian. That helps - Nina A
Hi, was doing the same thing, running this once a day, but was concerned about the atomicity of this operation, say if I do a read on the same partition. - Arushi

2 Answers

1
votes

Concatenated file size can be controlled with following two values:

set mapreduce.input.fileinputformat.split.minsize=268435456;
set hive.exec.orc.default.block.size=268435456;

These values should be set based on your HDFS/MapR-FS block size.

0
votes

As commented by @leftjoin it is indeed the case that you can get different output files for the same underlying data.

This is discussed more in the linked HCC thread but the key point is:

Concatenation depends on which files are chosen first.

Note that having files of different sizes, should not be a problem in normal situations.

If you want to streamline your process, then depending on how big your data is, you may also want to batch it a bit before writing to HDFS. For instance, by setting the batch size in NiFi.