All uuid columns below use the native Postgres uuid column type.
Have a lookup table where the uuid (uuid type 4 - so as random as can feasibly be) is the primary key. Regularly pull sequence of rows, say 10,000 from this lookup table.
Then, wish to use that set of uuid's retrieved from the lookup table to query other tables, typically two others, using the UUID's just retrieved. The UUID's in the other tables (tables A and B) are not primary keys. UUID columns in other tables A and B have UNIQUE constraints (btree indices).
Currently not doing this merging using a JOIN of any kind, just simple:
- Query lookup table, get uuids.
- Query table A using uuids from (1)
- Query table B using uuids from (1)
The issue is that queries (2) and (3) are surprisingly slow. So for around 4000 rows in tables A and B, particularly table A, around 30-50 seconds typically. Table A has around 60M rows.
Dealing with just table A, when using EXPLAIN ANALYZE, reports as doing an "Index Scan" on the uuid column in column A, with an Index Cond in the EXPLAIN ANALYZE output.
I've experiment with various WHERE clauses:
- uuid = ANY ('{
- uuid = ANY(VALUES('
- uuid ='uuid1' OR uuid='uuid2' etc ....
And experimented with btree (distinct), hash index table A on uuid, btree and hash index.
By far the fastest (which is still relatively slow) is: btree and use of "ANY ('{" in the WHERE clause.
Various opinions I've read:
- Actually doing a proper JOIN e.g. LEFT OUTER JOIN across the three tables.
- That the use of uuid type 4 is the problem, it being a randomly generated id, as opposed to a sequence based id.
- Possibly experimenting with work_mem.
Anyway. Wondered if anyone else had any other suggestions?
Table: "lookup" uuid: type uuid. not null. plain storage. datetime_stamp: type bigint. not null. plain storage. harvest_date_stamp: type bigint. not null. plain storage. state: type smallint. not null. plain storage. Indexes: "lookup_pkey" PRIMARY KEY, btree (uuid) "lookup_32ff3898" btree (datetime_stamp) "lookup_6c8369bc" btree (harvest_date_stamp) "lookup_9ed39e2e" btree (state) Has OIDs: no Table: "article_data"` int: type integer. not null default nextval('article_data_id_seq'::regclass). plain storage. title: text. text: text. insertion_date: date harvest_date: timestamp with time zone. uuid: uuid. Indexes: "article_data_pkey" PRIMARY KEY, btree (id) "article_data_uuid_key" UNIQUE CONSTRAINT, btree (uuid) Has OIDs: no
Both lookup and article_data have around 65m rows. Two queries:
SELECT uuid FROM lookup WHERE state = 200 LIMIT 4000;
OUTPUT FROM EXPLAIN (ANALYZE, BUFFERS): Limit (cost=0.00..4661.02 rows=4000 width=16) (actual time=0.009..1.036 rows=4000 loops=1) Buffers: shared hit=42 -> Seq Scan on lookup (cost=0.00..1482857.00 rows=1272559 width=16) (actual time=0.008..0.777 rows=4000 loops=1) Filter: (state = 200) Rows Removed by Filter: 410 Buffers: shared hit=42 Total runtime: 1.196 ms (7 rows)
Question: Why does this do a sequence scan and not an index scan when there is a btree on state?
SELECT article_data.id, article_data.uuid, article_data.title, article_data.text
FROM article_data
WHERE uuid = ANY ('{f0d5e665-4f21-4337-a54b-cf0b4757db65,..... 3999 more uuid's ....}'::uuid[]);
OUTPUT FROM EXPLAIN (ANALYZE, BUFFERS): Index Scan using article_data_uuid_key on article_data (cost=5.56..34277.00 rows=4000 width=581) (actual time=0.063..66029.031 rows=400 0 loops=1) Index Cond: (uuid = ANY ('{f0d5e665-4f21-4337-a54b-cf0b4757db65,5618754f-544b-4700-9d24-c364fd0ba4e9,958e37e3-6e6e-4b2a-b854-48e88ac1fdb7, ba56b483-59b2-4ae5-ae44-910401f3221b,aa4 aca60-a320-4ed3-b7b4-829e6ca63592,05f1c0b9-1f9b-4e1c-8f41-07545d694e6b,7aa4dee9-be17-49df-b0ca-d6e63b0dc023,e9037826-86c4-4bbc-a9d5-6977ff7458af,db5852bf- a447-4a1d-9673-ead2f7045589 ,6704d89 .......}'::uuid[])) Buffers: shared hit=16060 read=4084 dirtied=292 Total runtime: 66041.443 ms (4 rows)
Question: Why is this so slow, even though it's reading from disk?
create table
statements and the output fromexplain analyze
into your question. – Mike Sherrill 'Cat Recall'