How does the CONCATENATE in ALTER TABLE command in HIVE works

Question

I am trying to understand how exactly the ALTER TABLE CONCATENATE in HIVE Works.

I saw this link How does Hive 'alter table <table name> concatenate' work? but all I got from this links is that for ORC Files, the merge happens at a stripe level.

I am looking for a detailed explanation of how CONCATENATE works. As an e.g I initially had 500 small ORC Files in the HDFS. I ran the Hive ALTER TABLE CONCATENATE and the files merged to 27 bigger files. Subsequent runs of CONCATENATE reduced the number of files to 16 and finally I ended up in two large files.( used version Hive 0.12 ) So I wanted to understand

How exactly CONCATENATE works? Does it looks at the existing number of files , as well as the size ? How will it determine the no: of output ORC files after concatenation?
Is there any known issues with using the Concatenate ? We are planning to run the concatenate one a day in the maintenance window
Is Using CTAS an alternative to concatenate and which is better? Note that my requirement is to reduce the no of ORC files (ingested through Nifi) without compromising performance of Read

Any help is appreciated and thanks in advance

Read this: community.hortonworks.com/questions/212611/… and this jira: issues.apache.org/jira/browse/HIVE-19090 — leftjoin
Hi, was doing the same thing, running this once a day, but was concerned about the atomicity of this operation, say if I do a read on the same partition. — Arushi

Hazhir Hazhir · Accepted Answer · 2021-03-13T10:26:40

Concatenated file size can be controlled with following two values:

set mapreduce.input.fileinputformat.split.minsize=268435456;
set hive.exec.orc.default.block.size=268435456;

These values should be set based on your HDFS/MapR-FS block size.

How does the CONCATENATE in ALTER TABLE command in HIVE works

2 Answers