Improving join query postgresql/postgis

Question

I believe postgresql can be faster dealing with my query but every attempt to modify it makes it slower!

I have 2 tables:

statistics (id, field1, [...], field10)
geometry (id, geom)

I created indexes on :

statistics.id
geometry.id
geometry (st_x(st_centroid(st_transform(geom, 2154))), st_y(st_centroid(st_transform(geom, 2154))))

Here is the query

EXPLAIN ANALYZE SELECT 
statistics.*,
st_x(st_centroid(st_transform(geometry.geom, 2154))) AS x,
st_y(st_centroid(st_transform(geometry.geom, 2154))) AS y

FROM statistics
 JOIN geometry ON statistics.id = geometry.id 

WHERE statistics.id not like '97%';

Here is the result

Hash Join  (cost=1294.66..5158.10 rows=36593 width=342) (actual time=20.788..1085.257 rows=36552 loops=1)
Hash Cond: (geometry.id = (statistics.id)::text)
->  Seq Scan on geometry  (cost=0.00..2445.46 rows=36593 width=279) (actual time=0.010..25.271 rows=36597 loops=1)
    Filter: (id !~~ '97%'::text)
->  Hash  (cost=835.96..835.96 rows=36696 width=69) (actual time=19.892..19.892 rows=36696 loops=1)
    Buckets: 4096  Batches: 1  Memory Usage: 3780kB
    ->  Seq Scan on statistics  (cost=0.00..835.96 rows=36696 width=69) (actual time=0.005..6.871 rows=36696 loops=1)
Planning time: 0.401 ms
Execution time: 1088.612 ms

The most expensive operation is the Hash Join. How would you reorganise there query to get a better result?

Below is the schema of the tables

CREATE TABLE "statistics" (
    "REG" integer,
    "DEP" character varying(10),
    "COM" character varying(50),
    "D03" integer,
    "D04" integer,
    "D05" integer,
    "D06" integer,
    "D07" integer,
    "D08" integer,
    "D09" integer,
    "D10" integer,
    "D11" integer,
    "D12" integer,
    "D13" integer,
    "id" text
);

CREATE TABLE geometry (  
    id text NOT NULL,
    id_geo numeric(10,0),
    cm_code character varying(3),
    name character varying(50),
    status character varying(20),
    lat integer,
    long integer,
    lat_centroid integer,
    long_centroid integer,
    z_ smallint,
    area numeric(10,0),
    population double precision,
    code_ct character varying(2),
    code_r character varying(1),
    code_dp character varying(2),
    name_dp character varying(30),
    code_rg character varying(2),
    geom geometry(MultiPolygon,4326),
    x real,
    y real
);

There is about 40 000 rows in each table

The indexes have been created as follow

CREATE INDEX statistics_id_idx ON public.statistics USING btree (id COLLATE pg_catalog."default");
CREATE INDEX geometry_geom_idx ON public.geometry USING gist (geom);
CREATE INDEX geometry_id_gin2 ON public.geometry  USING gin (id COLLATE pg_catalog."default" gin_trgm_ops);

for information I have tried different indexes (btree & gin) on geometry_id & statistics_id.

Size of each table? The only problem I see is the like and geometry.id is different type of statistics.id ?? I dont understand your geometry index. for st_x. That wont help for that select. — Juan Carlos Oropeza
The tables don't have a primary key, so a hashjoin is the best you can get. (also: using a text field for an id might be sub-optimal) — wildplasser
@wildplasser there is a primary on the id of each table. However the I get the hash join. — Blue
How could I know that? You did not put that in your table definitions. BTW: for small tables there is nothing wrong with a hash join. — wildplasser

Juan Carlos Oropeza Juan Carlos Oropeza · Accepted Answer · 2015-08-26T04:40:59

I dont see any problem with your query.

Things to check

(geometry.id = (statistics.id)::text) are both field same data type?
WHERE statistics.id not like '97%'; . LIKE '%me' will never use an index, but LIKE 'me%' can possibly use an index. Why doesnt use index?
st_x(st_centroid(st_transform(geometry.geom, 2154))) AS x, is a function, and that take time. Need to transform coordinate and then extract a value. You are better if calculate that value and store it in a field.
Your geometry index doesn't have any impact on this query, because you are calculating a value not searching for something.
If you want perfom geography search that isnt the right index neither. But we can talk about that later

Things to try

First the where like.

SELECT *
FROM statistics
WHERE statistics.id not like '97%';

Then just the join

SELECT statistics.*,
       geometry.geom
FROM statistics
JOIN geometry ON statistics.id = geometry.id

Then join + st_x

SELECT statistics.*,
       st_x(st_centroid(st_transform(geometry.geom, 2154))) AS x,
       st_y(st_centroid(st_transform(geometry.geom, 2154))) AS y
FROM statistics
JOIN geometry ON statistics.id = geometry.id

Then create the precalculate x, y column in geometry table

SELECT statistics.*,
       geometry.x,
       geometry.y,
FROM statistics
JOIN geometry ON statistics.id = geometry.id

Then join + st_x + where like and join + geometry.xy + where like

Compare time between each steps to check where is taking the most time.

Improving join query postgresql/postgis

1 Answers