0
votes

say I have a table t1 which is date partitioned on column sdate

I have another table t2 which is also partitioned on column pdate which has data already in some partitions (20200101 to 20200130)

Now, I want to overwrite few partitions ( 20200105 to 20200110) in t2 from data in t1

and I want to do it in a single query and not one query per partition.

Select Query:

SELECT sdate,
c1, c2

FROM `dataset.t1` where sdate between "2020-01-05" and "2020-01-10"

I have seen examples where we can specify destination table like t2$20200102, but that will overwrite only a single partition.

How can we write insert statement to populate selected partitions

Any ideas of how the insert query would look like?

1
Is there any primary key which can correlate the rows in table t1 and table t2?Alexandre Moraes
No, There isn't any key. My only purpose is to overwrite partitions in t1 with data from t2Chandan Bhattad
I asked that because to use merge, updating and deleting rows you would need a primary key to relate these two tables. Otherwise, the answer from Guillem Xercavins is the best option.Alexandre Moraes
Agree. I was looking for something which is "atomic". Since there are multiple steps involved, we might end up in inconsistent state if any of the step failsChandan Bhattad

1 Answers

1
votes

You can simply use SELECT sdate AS pdate in your INSERT statement:

INSERT dataset.t2 (pdate, c1, c2) 
  SELECT sdate AS pdate, c1, c2
  FROM dataset.t1
  WHERE sdate BETWEEN "2020-01-05" and "2020-01-10"

Of course, this will append the rows to each target partition so, if you want to overwrite them, you can delete them first with:

DELETE
  dataset.t2
WHERE
  pdate BETWEEN "2020-01-05" AND "2020-01-10"

Obviously, you should be careful when deleting partitions. An extra check can be added if you only want to overwrite dates with data in the source table:

DELETE
  dataset.t2
WHERE
  pdate IN (
    SELECT sdate 
    FROM dataset.t1 
    WHERE sdate BETWEEN "2020-01-05" AND "2020-01-10")

You can also consider using MERGE but then you would need to take into account what happens when matching row by row.

I did some tests with this gist.

Another option is to do a piece-wise SELECT query to break the data into three date ranges and UNION ALL to bring them together.

SELECT * FROM dataset.t2
WHERE pdate < "2020-01-05" OR pdate > "2020-01-10"
UNION ALL
SELECT * FROM dataset.t1
WHERE sdate BETWEEN "2020-01-05" AND "2020-01-10"

This requires specifying a destination table with WRITE_TRUNCATE and runs in a single job but has to essentially re-write the whole table.