I have two hive tables T1 and T2. T1 is an external table partitioned by column date1,hour1. It also has another date column called date2 (different from date1).
T2 is a hive table partitioned by date2.
I will get data incrementally every hour, and I can easily add it to table T1 with dynamic partition.
I want an efficient way to select data from T1 and load data incrementally into T2, partitioned by date2.
This is what I am doing now
insert into T2
select
*,
date2
from
(
select * from T1 where date1="a constant date" and hour1 = "a constant hour"
) T1SubQuery
left outer join
T2
on
T1SubQuery.idColumn = T2.idColumn
where
T2.idColumn is null
I am doing a left outer join and "where T1.idColumn is null" to simulate "where not in". And I am doing that to avoid duplicate data, the query can run multiple times and I want it to be idempotent.
Questions
Which partitions from T2 will be used in this query? How can I minimize the number of partitions used?
What is the most efficient way to do this kind of idempotent incremental data load?