Why does SQL LAG (HIVE) give differences of 0 when the data repeats at a later position in the table (not right next to the row in question)?

Question

I have a HIVE table like this:

device        metric            timestamp          value
 d_1         cpu_time      2020-08-15 00:05:00       10
 d_1         cpu_time      2020-08-15 00:10:00       12
 d_1         cpu_time      2020-08-15 00:15:00       08
 d_2         cpu_time      2020-08-15 00:05:00       62
 d_2         cpu_time      2020-08-15 00:10:00       14
 d_2         cpu_time      2020-08-15 00:15:00       10
 d_3         cpu_time      2020-08-15 00:05:00       12
 d_3         cpu_time      2020-08-15 00:10:00       44
 d_3         cpu_time      2020-08-15 00:15:00       60

So for each distinct device the time window is shown for 10 seconds (05:00 to 15:00). This means when a new device is encountered in the data, the set of 3 timestamps repeats.

The actual HIVE table has about 12 million rows, thousands of devices, and a total time window per device of 26 days (instead of the 10 seconds shown in the example table). Also, the sampling interval between timestamps is 5 seconds (just like the example table above). So the pattern is the same in the actual table as the example table, just more data.

I run the following query to determine the sampling interval (expected to be 5 minutes) for each metric:

select
    metric,
    (unix_timestamp(timestamp) - unix_timestamp(lag_ts)) / 60 sampling_interval_minutes,
    count(*) no_hits
from (
    select 
        t.*, 
        lag(timestamp) over(partition by metric order by timestamp) lag_ts
    from my_table t
) t
group by metric, (unix_timestamp(timestamp) - unix_timestamp(lag_ts)) / 60 
order by metric, no_hits desc

...which provides output like this for the real HIVE table:

metric      sampling_interval_minutes     no_hits
cpu_time              0.0                 11976480
cpu_time              5.0                  7486
cpu_time           1445.0                   1
cpu_time             NULL                   1

The 2nd row shows the expected output, since the time window in the actual HIVE table is 26 days, which is 7488 5-minute observations (7486 above but ignore that discrepancy).

The surprising result is obviously the 1st row, which shows 11976480 hits with 0 lag. This is almost all the rows in the HIVE table. I'm assuming this means that since the time window (26 days) repeats that those differences are considered 0 between timestamps. But I would have expected lag to not care about repeats, but rather just give the differences between rows as they are encountered in the data. In other words, I would have expected there to be about 12 million lags with 5 minute intervals. This is because there are about 1600 different devices in the actual table, and they all have 26 day time windows (7488 different 5 minute intervals), and 1600 x 7488 is about 12 million.

Does the lag in SQL use permutations or combinations instead of just assessing the differences in timestamps on a row-by-row basis?

Gordon Linoff Gordon Linoff · Accepted Answer · 2020-09-12T12:21:19

I think you want the calculation per device, so you should use:

    lag(timestamp) over (partition by metric, device order by timestamp) as lag_ts

As for this statement:

But I would have expected lag to not care about repeats, but rather just give the differences between rows as they are encountered in the data.

Your expectations are just not appropriate for a SQL database. SQL tables represent unordered sets. Everything that lag() knows about ordering and partitioning is in the lag() expression itself.

Why does SQL LAG (HIVE) give differences of 0 when the data repeats at a later position in the table (not right next to the row in question)?

1 Answers