1
votes

I have a postgresql 13database with the following data:

  • 1 MaterializedView mat_view 79 rows, columns: id (bigint), start_ts (double precision) and end_ts (double precision)
  • 1 table time_tbl: 320 mio rows, 9 columns, relevant columns: gen_ts (double) and time_id (bigint, primary key), index on gen_ts (btree) and index on time_id (btree)
  • 1 table v_tbl: 320 mio rows, 49 columns, relevant columns: _fk_time_id (bigint, key to time_tbl.time_id)

The following query uses fixed timestamps (1500 and 2772) takes around 800 ms:

WITH d AS(
SELECT * FROM time_bbl
WHERE gen_ts  BETWEEN 1500 AND 2772
)
SELECT * FROM d JOIN v_tbl ON d.time_id = _fk_time_id

The output of explain/analyse is

Nested Loop Inner Join
 - Index Scan on time_tbl   
 - Index Scan on v_tbl      

In the upper example I define fixed timestamps. When I change the example to the following:

WITH ts AS(
   SELECT start_ts, end_ts 
   FROM mat_view
   WHERE id = 60
), d AS(
SELECT * FROM time_bbl, ts
WHERE gen_ts  BETWEEN ts.start_ts AND ts_end_ts
)
SELECT * FROM d JOIN v_tbl ON d.time_id = _fk_time_id

It takes more than 50 seconds to complete. The time frame is the same.

The output of explain/analyse is:

Merge Inner Join 
 - Index Scan using _fk_time_id_idx on v_tbl (major costs!)
 - Materialize
   - Sort
     - Nested Loop Inner Join
       - Seq Scan on mat_view (minor costs)
       - Indes Scan on time_tbl 

Has anyone an idea how I can accelerate my query?


EDIT 1

Sorry, Laurenz EXPLAIN (ANALYZE, BUFFERS) returns

Merge Join  (cost=10305331.73..26533267.34 rows=36225336 width=286) (actual time=66017.916..66028.416 rows=12747 loops=1)
  Merge Cond: (v_tbl._fk_time_id = time_tbl.time_id)
  Buffers: shared hit=40 read=4474243 written=1
  ->  Index Scan using v_tbl_fk_time_id_idx on v_tbl  (cost=0.57..14793820.93 rows=326028024 width=222) (actual time=0.164..52890.886 rows=244510550 loops=1)
        Buffers: shared hit=4 read=4474084 written=1
  ->  Materialize  (cost=10303264.73..10481232.72 rows=35593597 width=64) (actual time=6.406..8.218 rows=12747 loops=1)
        Buffers: shared hit=36 read=159
        ->  Sort  (cost=10303264.73..10392248.73 rows=35593597 width=64) (actual time=6.402..6.914 rows=12747 loops=1)
              Sort Key: time_tbl.time_id
              Sort Method: quicksort  Memory: 2177kB
              Buffers: shared hit=36 read=159
              ->  Nested Loop  (cost=0.57..1824200.47 rows=35593597 width=64) (actual time=0.395..4.310 rows=12747 loops=1)
                    Buffers: shared hit=36 read=159
                    ->  Seq Scan on mat_view  (cost=0.00..1.99 rows=1 width=16) (actual time=0.016..0.017 rows=1 loops=1)
                          Filter: (id = 60)
                          Rows Removed by Filter: 78
                          Buffers: shared hit=1
                    ->  Index Scan using gen_ts_idx on time_tbl  (cost=0.57..1468262.51 rows=35593597 width=48) (actual time=0.372..3.205 rows=12747 loops=1)
                          Index Cond: ((gen_ts >= mat_view.start_ts) AND (gen_ts <= mat_view.end_ts))
                          Buffers: shared hit=35 read=159
Planning:
  Buffers: shared hit=12 read=8 dirtied=2
Planning Time: 0.980 ms
JIT:
  Functions: 16
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 2.298 ms, Inlining 5.521 ms, Optimization 101.687 ms, Emission 58.024 ms, Total 167.530 ms
Execution Time: 66031.189 ms
1
How many records does CTE d have? If it has hundreds of thousands of records, you can instead create a table called temp_d and insert data into it. Put an index on time_id and try the query again. See if that made any improvements. What's the acceptable worst-case time for a query like that in your mind? 20s? - zedfoxus
The CTE has only 150 entries. I am confused because to query the timestamp from d takes 70 ms, the first query with defined timestamps takes around 800 ms. But the combination of both takes so much ( 50 sec) longer. - pio
gen_ts (double) Why is this a double if it is supposed to represent a timestamp? - wildplasser
If yu paraphrase the execution plan, that isn't much help for us. Please provide a literal copy of the EXPLAIN (ANALYZE, BUFFERS) output. - Laurenz Albe

1 Answers

0
votes

My best bet for performance problem would be CTE. I have run to performance issues with them. This is a nice article which explains this: https://hakibenita.medium.com/be-careful-with-cte-in-postgresql-fca5e24d2119

Could you use subqueries instead?

SELECT 
    * 
FROM 
    time_bbl d JOIN v_tbl ON d.time_id = _fk_time_id
WHERE 
   gen_ts BETWEEN (SELECT start_ts FROM mat_view WHERE id = 60 LIMIT 1) AND (SELECT end_ts FROM mat_view WHERE id = 60 LIMIT 1)

There well may be logical and/or syntax errors in the query. But if you can create a query without CTE it would be worth of testing.