I have a HIVE table like this:
device metric timestamp value
d_1 cpu_time 2020-08-15 00:05:00 10
d_1 cpu_time 2020-08-15 00:10:00 12
d_1 cpu_time 2020-08-15 00:15:00 08
d_2 cpu_time 2020-08-15 00:05:00 62
d_2 cpu_time 2020-08-15 00:10:00 14
d_2 cpu_time 2020-08-15 00:15:00 10
d_3 cpu_time 2020-08-15 00:05:00 12
d_3 cpu_time 2020-08-15 00:10:00 44
d_3 cpu_time 2020-08-15 00:15:00 60
So for each distinct device the time window is shown for 10 seconds (05:00 to 15:00). This means when a new device is encountered in the data, the set of 3 timestamps repeats.
The actual HIVE table has about 12 million rows, thousands of devices, and a total time window per device of 26 days (instead of the 10 seconds shown in the example table). Also, the sampling interval between timestamps is 5 seconds (just like the example table above). So the pattern is the same in the actual table as the example table, just more data.
I run the following query to determine the sampling interval (expected to be 5 minutes) for each metric:
select
metric,
(unix_timestamp(timestamp) - unix_timestamp(lag_ts)) / 60 sampling_interval_minutes,
count(*) no_hits
from (
select
t.*,
lag(timestamp) over(partition by metric order by timestamp) lag_ts
from my_table t
) t
group by metric, (unix_timestamp(timestamp) - unix_timestamp(lag_ts)) / 60
order by metric, no_hits desc
...which provides output like this for the real HIVE table:
metric sampling_interval_minutes no_hits
cpu_time 0.0 11976480
cpu_time 5.0 7486
cpu_time 1445.0 1
cpu_time NULL 1
The 2nd row shows the expected output, since the time window in the actual HIVE table is 26 days, which is 7488 5-minute observations (7486 above but ignore that discrepancy).
The surprising result is obviously the 1st row, which shows 11976480 hits with 0 lag. This is almost all the rows in the HIVE table. I'm assuming this means that since the time window (26 days) repeats that those differences are considered 0 between timestamps. But I would have expected lag to not care about repeats, but rather just give the differences between rows as they are encountered in the data. In other words, I would have expected there to be about 12 million lags with 5 minute intervals. This is because there are about 1600 different devices in the actual table, and they all have 26 day time windows (7488 different 5 minute intervals), and 1600 x 7488 is about 12 million.
Does the lag in SQL use permutations or combinations instead of just assessing the differences in timestamps on a row-by-row basis?