Exact behavior of ADD COLUMN with CASCADE in Hive

Question

Suppose the next table:

sch_test.test_cascade
    a    b    c
    -----------
    6    10   1
    6    10   1
    6    10   2
    6    10   2

a, b and c are integers and the table is partitioned by c. When I add a column in Hive with CASCADE:

ALTER TABLE sch_test.test_cascade ADD COLUMNS (d int, e int) CASCADE;

it returns the next table:

sch_test.test_cascade
a    b    d    e     c
-------------------------
6    10   1    NULL  1    
6    10   1    NULL  1  
6    10   2    NULL  2  
6    10   2    NULL  2

That is, the first inserted column (d, in this case) takes the values of the partition column (Note that I've added two columns to show that the second one is NULL). But, if I add a column without CASCADE:

ALTER TABLE sch_test.test_cascade ADD COLUMNS (d int, e int);

it returns the next table:

sch_test.test_cascade
a    b    d       e      c
----------------------------
6    10   NULL    NULL   1
6    10   NULL    NULL   1
6    10   NULL    NULL   2
6    10   NULL    NULL   2

That is, both inserted columns are NULL.

I don't understand very well the difference between add a column with CASCADE or RESTRICT, in Hive documentation I can see:

The CASCADE|RESTRICT clause is available in Hive 1.1.0. ALTER TABLE ADD|REPLACE COLUMNS with CASCADE command changes the columns of a table's metadata, and cascades the same change to all the partition metadata. RESTRICT is the default, limiting column changes only to table metadata.

But it isn't clear to me the difference between "cascades the same change to all the partition metadata" and "limiting column changes only to table metadata". (Does Restrict apply only to one partition and Cascade to all them? The previous example says the opposite). And if that is the reason why the first column inserted (in CASCADE mode) takes the value of the partition column.

Without CASCADE the ALTER applies only to the default definition i.e. the one that will be used for new partitions, but not to existing partitions. The rationale is that you already have legacy data files with legacy format in legacy partitions. — Samson Scharfrichter
About your test case with the partition key being used also "magically" as the next actual column, well, ahem, I can only guess. Are you sure there wasn't an extra column in the data files, that was hidden because your Hive definition mapped only the first 2 cols?? Can you tail one of these files to be sure? — Samson Scharfrichter
Thank you for you answer. I'm pretty sure because I've written the files by hand, so I think the assumption that there are extra columns doesn't apply, it's still a mystery! Nevertheless your explanation about CASCADE and ALTER has sense, because only CASCADE changes de value of old partitions (but in a strange way). — Amanda
if you don't add cascade then you can't load(historical or backfill) those columns in the already existing partitions. — Sivaji

Pramod Reddy Baireddy Pramod Reddy Baireddy · Accepted Answer · 2020-11-18T07:00:41

If we don't add cascade during ADD column, the existing partitions would have same definition. This new column would be null in them, even after overwriting the partition with new field. You can drop and recreate the partition to fix this though.

Exact behavior of ADD COLUMN with CASCADE in Hive

1 Answers