How to only select rows closest to a given series of values?

Question

I have the following table:

create table foobar (
    account_id  bigint    not null,
    timestamp   timestamp not null,
    balance_one numeric   not null,
    balance_two numeric   not null,
    primary key (timestamp, account_id)
);

I have a given time range with BEGIN and END. Within that range, a series of timestamps is generated with a given step size, starting at the lower bound. Let's call those timestamps "interval boundaries".

I want to select all rows within the given time range with a timestamp closer than any other to one of the generated timestamps in the series.

The general part of the query is pretty simple, select rows within the given time range:

select * from foobar
where account_id = ?
and timestamp between {BEGIN} and {END}
order by timestamp asc;

The part I don't understand, is how to trim my result set to only the rows closest to the interval boundaries.

Example data:

account_id	timestamp	balance_one	balance_two
1	28 June 2021 17:00:00	0.0	0.0
1	28 June 2021 17:00:05	1.0	0.0
1	28 June 2021 17:00:10	0.5	0.0
1	28 June 2021 17:00:15	0.0	1.0
1	28 June 2021 17:00:20	0.5	0.0
1	28 June 2021 17:00:25	1.0	1.0
2	28 June 2021 17:00:00	0.0	0.0
2	28 June 2021 17:00:05	1.0	7.0
2	28 June 2021 17:00:07	2.0	6.0
2	28 June 2021 17:00:15	3.0	5.0
2	28 June 2021 17:00:20	4.0	4.0
2	28 June 2021 17:00:25	5.0	3.0
2	28 June 2021 17:00:30	6.0	2.0
2	28 June 2021 17:00:35	7.0	1.0

Example queries with params:

ID = 1, BEGIN = 28 June 2021 17:00:00, END = 28 June 2021 17:00:30, INTERVAL = 10 seconds

Result:

account_id	timestamp	balance_one	balance_two
1	28 June 2021 17:00:00	0.0	0.0
1	28 June 2021 17:00:10	0.5	0.0
1	28 June 2021 17:00:20	0.5	0.0
1	28 June 2021 17:00:25	1.0	1.0

ID = 2, BEGIN = 28 June 2021 17:00:00, END = 28 June 2021 17:00:30, INTERVAL = 10 seconds Result:

account_id	timestamp	balance_one	balance_two
2	28 June 2021 17:00:00	0.0	0.0
2	28 June 2021 17:00:10	2.0	6.0
2	28 June 2021 17:00:20	4.0	4.0
2	28 June 2021 17:00:30	6.0	2.0

What do you mean "closest to the interval boundaries"? Sample data, desired results would help -- as would a clear explanation. — Gordon Linoff
From any starting timestamp say 0, incrementing in intervals of say 5, I want the record with timestamp closest to the interval timestamps — vddox
I took the liberty to rephrase your problem description. Please fix if I didn't get it right. — Erwin Brandstetter

Erwin Brandstetter Erwin Brandstetter · Accepted Answer · 2021-06-28T23:09:09

SELECT DISTINCT (COALESCE(p.t, n.t)).*
FROM   generate_series(timestamp '2021-06-28 17:00:00'  -- lower bound of time frame
                     , timestamp '2021-06-28 17:00:30'  -- upper bound of time frame
                     , interval '10 sec') AS g(ts)      -- step size
LEFT   JOIN LATERAL (
   SELECT t.timestamp, t
   FROM   tbl t
   WHERE  account_id = 2
   AND    t.timestamp >= g.ts
   AND    t.timestamp BETWEEN timestamp '2021-06-28 17:00:00'
                      AND     timestamp '2021-06-28 17:00:30'  -- enforce time frame
   ORDER  BY t.timestamp
   LIMIT  1
   ) n ON true
LEFT   JOIN LATERAL (
   SELECT t.timestamp, t
   FROM   tbl t
   WHERE  account_id = 2
   AND    t.timestamp < g.ts
   AND    t.timestamp BETWEEN timestamp '2021-06-28 17:00:00'
                      AND     timestamp '2021-06-28 17:00:30'  -- enforce time frame
   ORDER  BY t.timestamp DESC
   LIMIT  1
   ) p ON g.ts - p.timestamp < n.timestamp - g.ts
       OR n.timestamp IS NULL   -- suppress NULL row for no later match
WHERE  n.timestamp IS NOT NULL
    OR p.timestamp IS NOT NULL  -- suppress NULL row for no match at all
ORDER  BY timestamp;

db<>fiddle here

The series of "interval boundaries" starts at the lower bound of the time frame. It's unclear how to handle the upper bound of the time frame, if steps of your interval don't fall in line. This query ignores the upper bound in this case.

User instructions

Generate the series of "interval boundaries" with generate_series(). See:
- Generating time series between two dates in PostgreSQL
The first LATERAL subquery fetches the row with the next greater (or equal) timestamp for each interval boundary. It's undefined in the question how to handle duplicate timestamps. This query chooses an arbitrary winner. I would make that deterministic instead, by adding more ORDER BY expressions. See:
- What is the difference between LATERAL JOIN and a subquery in PostgreSQL?
The second LATERAL subquery fetches the row with the next smaller timestamp - but only if that's closer to the interval boundary than the row from the first subquery. (The later row wins ties.)
Use COALESCE in the outer SELECT to get the winning row per interval boundary.
Add DISTINCT (or some other way to remove duplicates), since the same row can be picked multiple times. (Like when there it's closest to more than one interval boundary.
Optional outer ORDER BY.

Since you want to get the whole row, I make use of the implicitly defined row type of the table. Saves another read from the table.

If your table is big, you need an index on (account_id, timestamp) for this to be fast.

Alternative with CTE

If you don't have an index, or if your table isn't big, or if there are only few rows in the given time frame, this alternative with a CTE will be faster:

WITH cte AS (
   SELECT * FROM tbl
   WHERE  account_id = 2
   AND    timestamp BETWEEN timestamp '2021-06-28 17:00:00'  -- lower bound of time frame
                    AND     timestamp '2021-06-28 17:00:30'  -- upper bound of time frame
   )
SELECT DISTINCT (COALESCE(p.t, n.t)).*
FROM   generate_series(timestamp '2021-06-28 17:00:00'  -- lower bound
                     , timestamp '2021-06-28 17:00:30'  -- upper bound
                     , interval  '10 sec') AS g(ts)     -- step size
LEFT   JOIN LATERAL (
   SELECT t.timestamp, t::tbl
   FROM   cte t
   WHERE  t.timestamp >= g.ts
   ORDER  BY t.timestamp
   LIMIT  1
   ) n ON true
LEFT   JOIN LATERAL (
   SELECT t.timestamp, t::tbl
   FROM   cte t
   WHERE  t.timestamp < g.ts
   ORDER  BY t.timestamp DESC
   LIMIT  1
   ) p ON g.ts - p.timestamp < n.timestamp - g.ts
       OR n.timestamp IS NULL   -- suppress NULL row for no later match
WHERE  n.timestamp IS NOT NULL
    OR p.timestamp IS NOT NULL  -- suppress NULL row for no match at all
ORDER  BY timestamp;

db<>fiddle here

Much like the above. But filter for the outer time frame first in a CTE. We need the explicit cast to the underlying table type, because the derived table is not registered.

How to only select rows closest to a given series of values?

1 Answers

User instructions

Alternative with CTE