0
votes

All uuid columns below use the native Postgres uuid column type.

Have a lookup table where the uuid (uuid type 4 - so as random as can feasibly be) is the primary key. Regularly pull sequence of rows, say 10,000 from this lookup table.

Then, wish to use that set of uuid's retrieved from the lookup table to query other tables, typically two others, using the UUID's just retrieved. The UUID's in the other tables (tables A and B) are not primary keys. UUID columns in other tables A and B have UNIQUE constraints (btree indices).

Currently not doing this merging using a JOIN of any kind, just simple:

  1. Query lookup table, get uuids.
  2. Query table A using uuids from (1)
  3. Query table B using uuids from (1)

The issue is that queries (2) and (3) are surprisingly slow. So for around 4000 rows in tables A and B, particularly table A, around 30-50 seconds typically. Table A has around 60M rows.

Dealing with just table A, when using EXPLAIN ANALYZE, reports as doing an "Index Scan" on the uuid column in column A, with an Index Cond in the EXPLAIN ANALYZE output.

I've experiment with various WHERE clauses:

  • uuid = ANY ('{
  • uuid = ANY(VALUES('
  • uuid ='uuid1' OR uuid='uuid2' etc ....

And experimented with btree (distinct), hash index table A on uuid, btree and hash index.

By far the fastest (which is still relatively slow) is: btree and use of "ANY ('{" in the WHERE clause.

Various opinions I've read:

  • Actually doing a proper JOIN e.g. LEFT OUTER JOIN across the three tables.
  • That the use of uuid type 4 is the problem, it being a randomly generated id, as opposed to a sequence based id.
  • Possibly experimenting with work_mem.

Anyway. Wondered if anyone else had any other suggestions?

Table: "lookup"

uuid: type uuid. not null. plain storage.
datetime_stamp: type bigint. not null. plain storage.
harvest_date_stamp: type bigint. not null. plain storage.
state: type smallint. not null. plain storage.

Indexes:
    "lookup_pkey" PRIMARY KEY, btree (uuid)
    "lookup_32ff3898" btree (datetime_stamp)
    "lookup_6c8369bc" btree (harvest_date_stamp)
    "lookup_9ed39e2e" btree (state)
Has OIDs: no

Table: "article_data"`

int: type integer. not null default nextval('article_data_id_seq'::regclass). plain storage.
title: text.
text: text.
insertion_date: date
harvest_date: timestamp with time zone.
uuid: uuid.

Indexes:
    "article_data_pkey" PRIMARY KEY, btree (id)
    "article_data_uuid_key" UNIQUE CONSTRAINT, btree (uuid)
Has OIDs: no

Both lookup and article_data have around 65m rows. Two queries:

SELECT uuid FROM lookup WHERE state = 200 LIMIT 4000;
OUTPUT FROM EXPLAIN (ANALYZE, BUFFERS):  
 Limit  (cost=0.00..4661.02 rows=4000 width=16) (actual time=0.009..1.036 rows=4000 loops=1)  
   Buffers: shared hit=42  
   ->  Seq Scan on lookup  (cost=0.00..1482857.00 rows=1272559 width=16) (actual time=0.008..0.777 rows=4000 loops=1)  
         Filter: (state = 200)  
         Rows Removed by Filter: 410  
         Buffers: shared hit=42  
 Total runtime: 1.196 ms  
(7 rows)  

Question: Why does this do a sequence scan and not an index scan when there is a btree on state?

SELECT article_data.id, article_data.uuid, article_data.title, article_data.text 
FROM article_data 
WHERE uuid = ANY ('{f0d5e665-4f21-4337-a54b-cf0b4757db65,..... 3999 more uuid's ....}'::uuid[]);
OUTPUT FROM EXPLAIN (ANALYZE, BUFFERS):  
 Index Scan using article_data_uuid_key on article_data  (cost=5.56..34277.00 rows=4000 width=581) (actual time=0.063..66029.031 rows=400  
0 loops=1)  
   Index Cond: (uuid = ANY ('{f0d5e665-4f21-4337-a54b-cf0b4757db65,5618754f-544b-4700-9d24-c364fd0ba4e9,958e37e3-6e6e-4b2a-b854-48e88ac1fdb7,  ba56b483-59b2-4ae5-ae44-910401f3221b,aa4  
aca60-a320-4ed3-b7b4-829e6ca63592,05f1c0b9-1f9b-4e1c-8f41-07545d694e6b,7aa4dee9-be17-49df-b0ca-d6e63b0dc023,e9037826-86c4-4bbc-a9d5-6977ff7458af,db5852bf-  a447-4a1d-9673-ead2f7045589  
,6704d89  .......}'::uuid[]))  
   Buffers: shared hit=16060 read=4084 dirtied=292  
 Total runtime: 66041.443 ms  
(4 rows)  

Question: Why is this so slow, even though it's reading from disk?

1
Paste create table statements and the output from explain analyze into your question.Mike Sherrill 'Cat Recall'

1 Answers

0
votes

Without seeing your table structure and the output of explain analyze..., I'd expect an inner join on the lookup table to give the best performance. (My table_a has about 10 million rows.)

select * 
from table_a
inner join 
    -- Brain dead way to get about 1000 rows 
    -- from a renamed scratch table.
    (select test_uuid from lookup_table
     where test_id < 10000) t
on table_a.test_uuid = t.test_uuid;
"Nested Loop  (cost=0.72..8208.85 rows=972 width=36) (actual time=0.041..11.825 rows=999 loops=1)"
"  ->  Index Scan using uuid_test_2_test_id_key on lookup_table  (cost=0.29..39.30 rows=972 width=16) (actual time=0.015..0.414 rows=999 loops=1)"
"        Index Cond: (test_id   Index Scan using uuid_test_test_uuid_key on table_a  (cost=0.43..8.39 rows=1 width=20) (actual time=0.010..0.011 rows=1 loops=999)"
"        Index Cond: (test_uuid = lookup_table.test_uuid)"
"Planning time: 0.503 ms"
"Execution time: 11.953 ms"