1
votes

I am working with Hive's dynamic partitions, and I am having an issue with a column not being populated with data unless a new partition is added. I have created a small demo example to demonstrate.

1. create table vegetables (name string, count bigint) partitioned by (year int, month int, day int);
2. create external  table dataForVeg (name string, count bigint, weight string, year int, month int, day int) row format delimited fields terminated by ' ';
3. load data1 into dataforveg 
4. set hive.exec.dynamic.partition.mode=nonstrict;
5. insert into table vegetables  partition(year, month, day) select name, count, year, month, day from dataforveg;

6. hive> select * from vegetables where day='5';
tomato  5       2013    11      5
cabbage 3       2013    11      5

7. hive> alter table  vegetables add columns(weight double);
8. hive> describe vegetables ;
name    string
count   bigint
weight  double
year    int
month   int
day     int

9. hive> select * from vegetables where day='5';
tomato  5       NULL    2013    11      5
cabbage 3       NULL    2013    11      5

hive> select * from vegetables where day='4';
potato  2       NULL    2013    11      4

10. load overwrite data2 into dataforveg 
11. hive> select * from dataforveg;
carrot  10      5       2013    11      5
pepper  15      2       2013    11      5

12. hive> select *  from vegetables where day='5';
tomato  5       NULL    2013    11      5
cabbage 3       NULL    2013    11      5
carrot  10      NULL    2013    11      5
pepper  15      NULL    2013    11      5

13. load overwrite data3 into dataforveg 
hive> select * from dataforveg;
beet      4       1       2013    11      6
broccoli  3       1       2013    11      6

14. hive> select *  from vegetables;
potato  2       NULL    2013    11      4
tomato  5       NULL    2013    11      5
cabbage 3       NULL    2013    11      5
carrot  10      NULL    2013    11      5
pepper  15      NULL    2013    11      5
beet    4       1.0     2013    11      6
broccoli 3       1.0     2013    11      6

As you can see from the example, data is being updated when you add a new partition. Question: is there a way to refresh the value of a new field 'weight' for carrot and pepper in step 12? In other words, is there a way to fill the newly added columns with data in an already existing partition?

1

1 Answers

0
votes

The underlying file system, HDFS, does not support updating or even appending files. Your only option in this situation is to create a MapReduce job that would merge the data from the old partition with the values for the new column and then replace the files from that partition with the output of that MapReduce job.

If you are not into writing MapReduce jobs, you can probably rig something combining Hive CTAS (create table as select) and HDFS operations.