Proper way to access latest row for each individual identifier?

Question

I have a table core_message in Postgres, with millions of rows that looks like this (simplified):

┌────────────────┬──────────────────────────┬─────────────────┬───────────┬──────────────────────────────────────────┐
│    Colonne     │           Type           │ Collationnement │ NULL-able │                Par défaut                │
├────────────────┼──────────────────────────┼─────────────────┼───────────┼──────────────────────────────────────────┤
│ id             │ integer                  │                 │ not null  │ nextval('core_message_id_seq'::regclass) │
│ mmsi           │ integer                  │                 │ not null  │                                          │
│ time           │ timestamp with time zone │                 │ not null  │                                          │
│ point          │ geography(Point,4326)    │                 │           │                                          │
└────────────────┴──────────────────────────┴─────────────────┴───────────┴──────────────────────────────────────────┘
Index:
    "core_message_pkey" PRIMARY KEY, btree (id)
    "core_message_uniq_mmsi_time" UNIQUE CONSTRAINT, btree (mmsi, "time")
    "core_messag_mmsi_b36d69_idx" btree (mmsi, "time" DESC)
    "core_message_point_id" gist (point)

The mmsi column is a unique identifier used to identify ships in the world. I'm trying to get the latest row for each mmsi.

I can get that like this, for example:

SELECT a.* FROM core_message a
JOIN  (SELECT mmsi, max(time) AS time FROM core_message GROUP BY mmsi) b
       ON a.mmsi=b.mmsi and a.time=b.time;

But this is too slow, 2 seconds+.

So my solution was to create a distinct table containing only the latest rows (100K+ rows max) of the core_message table, called LatestMessage.

This table is populated via my application every time new rows have to be added to core_message.

It worked fine, I'm able to access the table in a matter of milliseconds. But I'd be curious to know if there is a better way to achieve that using only one table and keep the same level of performance for data access.

Possible duplicate of Select first row in each GROUP BY group? — Clockwork-Muse
@Clockwork-Muse While this answer in itself does not solve my case, one of the answer referenced a way to solve my problem. I will post an answer here for my particular case using that way. — ogr
There is some good information. It also matters how many millions of rows there are and how many distinct mmsi exactly (*100K+ rows max`?). And some other details - as instructed here — Erwin Brandstetter

ogr ogr · Accepted Answer · 2019-09-17T13:35:02

Here is a quick performance comparison for the queries mention in this post.

Current setup :

The table core_message has 10,904,283 rows and there is 60,740 rows in test_boats (or 60,740 distinct mmsi in core_message).

And I'm using PostgreSQL 11.5

Query using index-only scan :

1) using DISTINCT ON :

SELECT DISTINCT ON (mmsi) mmsi 
FROM core_message;

2) using RECURSIVE with LATERAL:

WITH RECURSIVE cte AS (
   (
   SELECT mmsi
   FROM   core_message
   ORDER  BY mmsi
   LIMIT  1
   )
   UNION ALL
   SELECT m.*
   FROM   cte c
   CROSS  JOIN LATERAL (
      SELECT mmsi
      FROM   core_message
      WHERE  mmsi > c.mmsi
      ORDER  BY mmsi
      LIMIT  1
      ) m
   )
TABLE cte;

3) Using an extra table with LATERAL:

SELECT a.mmsi
FROM test_boats a
CROSS JOIN LATERAL(
    SELECT b.time
    FROM core_message b
    WHERE a.mmsi = b.mmsi
    ORDER BY b.time DESC
    LIMIT 1
) b;

Query not using index-only scan :

4) using DISTINCT ON with mmsi,time DESC INDEX:

SELECT DISTINCT ON (mmsi) * 
FROM core_message 
ORDER BY mmsi, time desc;

5) using DISTINCT ON with backward mmsi,time UNIQUE CONSTRAINT:

SELECT DISTINCT ON (mmsi) * 
FROM core_message 
ORDER BY mmsi desc, time desc;

6) using RECURSIVE with LATERAL and mmsi,time DESC INDEX:

WITH RECURSIVE cte AS (
   (
   SELECT *
   FROM   core_message
   ORDER  BY mmsi , time DESC 
   LIMIT  1
   )
   UNION ALL
   SELECT m.*
   FROM   cte c
   CROSS  JOIN LATERAL (
      SELECT *
      FROM   core_message
      WHERE  mmsi > c.mmsi
      ORDER  BY mmsi , time DESC 
      LIMIT  1
      ) m
   )
TABLE cte;

7) using RECURSIVE with LATERAL and backward mmsi,time UNIQUE CONSTRAINT:

WITH RECURSIVE cte AS (

   (

   SELECT *
   FROM   core_message
   ORDER  BY mmsi DESC , time DESC 
   LIMIT  1
   )
   UNION ALL
   SELECT m.*
   FROM   cte c
   CROSS  JOIN LATERAL (
      SELECT *
      FROM   core_message
      WHERE  mmsi < c.mmsi
      ORDER  BY mmsi DESC , time DESC 
      LIMIT  1
      ) m
   )
TABLE cte;

8) Using an extra table with LATERAL:

SELECT b.*
FROM test_boats a
CROSS JOIN LATERAL(
    SELECT b.*
    FROM core_message b
    WHERE a.mmsi = b.mmsi
    ORDER BY b.time DESC
    LIMIT 1
) b;

Using a dedicated table for the last message:

9) Here is my initial solution, using a distinct table with only the last message. This table is populated as new messages arrive but could also be created like so :

CREATE TABLE core_shipinfos AS (
    WITH RECURSIVE cte AS (
       (
       SELECT *
       FROM   core_message
       ORDER  BY mmsi DESC , time DESC 
       LIMIT  1
       )
       UNION ALL
       SELECT m.*
       FROM   cte c
       CROSS  JOIN LATERAL (
          SELECT *
          FROM   core_message
          WHERE  mmsi < c.mmsi
          ORDER  BY mmsi DESC , time DESC 
          LIMIT  1
          ) m
       )
    TABLE cte);

Then the request to get the latest message is as simple as that :

SELECT * FROM core_shipinfos;

Results :

Average of multiple query (around 5 for the fast one):

1) 9146 ms
2) 728 ms
3) 498 ms

4) 51488 ms
5) 54764 ms
6) 729 ms
7) 778 ms
8) 516 ms

9) 15 ms

Conclusion:

I won't comment on the dedicated table solution, and will keep that for the end.

The additional table (test_boats) solution is definitely the winner here but the RECURSIVE solution is also pretty efficient.

There is a huge gap in performance for the DISTINCT ON using index-only scan and the one not using it but, the performance gain is rather small for the other efficient query.

This makes sense as the major improvement those queries bring is the fact that they don't need to loop over the whole core_message table but only on a subset of the unique mmsi that is significantly smaller (60K+) in comparison to the core_message table size (10M+)

As an additional note, there does not seem to be significant improvement in performance for the queries using the UNIQUE CONSTRAINT if I drop the mmsi,time DESC INDEX. But dropping that index will of course save me some space (this index currently take 328MB)

About the dedicated table solution:

Each messages stored in the core_message table carries both positional informations (position, speed, heading,etc.) AND ship informations (name, callsign, dimensions, etc.), as well as ship identifier (mmsi).

To give a bit more background on what I'm actually trying to do : I'm implementing a backend to store messages emitted by ships via the AIS protocol.

As such, every unique mmsi I got, I got it via this protocol. It is not a pre-defined list. It keeps adding new MMSI until I got every ships in the world using AIS.

In that context, a dedicated table with ship informations as last message received makes sense.

I could avoid using such a table as we've seen with the RECURSIVE solution, but... a dedicated table is still 50x faster than this RECURSIVE solution.

That dedicated table is in fact similar to the test_boat table, with more information than just the mmsi field. As it is, having a table with mmsi only field or a table with every last informations of the core_message table add the same complexity to my application.

In the end, I think I will go for this dedicated table. It will gives me unbeatable speed and I'll still have the possibility to use the LATERAL trick on core_message, which will gives me more flexibility.

Proper way to access latest row for each individual identifier?

5 Answers