0
votes

I have a table with history data which is more than a TB size and I would be receiving delta (updated info) records on daily basis which will be in GB size and stored in delta table. Now I want to compare the delta records with the history records and update the History table with the latest data from Delta table.

What is the best approach to do this in Hive since I would be dealing with millions of rows. I have searched the web and found the below approach.

http://hortonworks.com/blog/four-step-strategy-incremental-updates-hive

But I don't think it would a be best approach in the aspect of performance.

1

1 Answers

0
votes

In Latest hive (0.14), you can do updates. You need to keep the table in ORC format and bucket by the searching key.

Oh, and I need to add this link for more information: Hive Transactions

In addition: Do you have a good partitioning key so that the updates will only have to work on latest partitions? it can be good to do the following:

  1. get data from required partitions to a temp table (T1)

  2. let's say T2 is the new table with update records. need to be partitioned the same way as T1

  3. Join T1 and T2 with key(s) and take the ones only present in T1 and not in T2. Let's say this table is T3
  4. Union T2 and T3 to create table T4
  5. Drop the previously taken partitions from T1
  6. Insert T4 into T1

Remember, the operations may not be atomic and during the time step 5 and 6 happens, any query running on T1 can have intermediate results.