I am new to PostgreSQL and PostGIS but the question is not trivial. I am using PostgreSQL 9.5 with PostGIS 2.2.
I need to run some queries that take a horrible amount of time.
First, let me explain the problem in non-GIS terms :
Basically, I have a set of several hundreds of thousands of points spread over a territory of about half a million square kilometres a (country).
Over this territory, I have about a dozen sets of areas coming from various databases. In each set, I have between a few hundreds and a few thousands of areas. I want to find which points are in any of these areas.
Now, how I am currently working out the problem in GIS terms :
Each set of areas is a Postgresql table with a geometry column of the type multipolygon and with, as explained before a few hundreds to a few thousand records.
All these tables are contained in a schema donnees but I am using a different schema for these operations, called traitements.
So the process is a/ merging all the geometries into a single geometry, and then b/ finding which points are contained in this geometry.
The problem is that, if step a/ took a reasonable amount of time (several minutes), step b/ takes forever.
I am currently working with only a sample of the points I must process (about 1% of them, i.e. about 7000) and it is not finished after several hours (the database connection eventually times out).
I am making tests running the query by limiting the number of return rows to 10 or 50 and it still takes about half an hour for that.
I am using a Linux Mint 18 machine with 4 CPU and 8 Gb of RAM if you wonder.
I have created indexes on the geometry columns. All geometry columns use the same SRID.
Creating the tables :
CREATE TABLE traitements.sites_candidats (
pkid serial PRIMARY KEY,
statut varchar(255) NOT NULL,
geom geometry(Point, 2154)
);
CREATE UNIQUE INDEX ON traitements.sites_candidats (origine, origine_id ) ;
CREATE INDEX ON traitements.sites_candidats (statut);
CREATE INDEX sites_candidats_geométrie ON traitements.sites_candidats USING GIST ( geom );
CREATE TABLE traitements.zones_traitements (
pkid serial PRIMARY KEY,
définition varchar(255) NOT NULL,
geom geometry (MultiPolygon, 2154)
);
CREATE UNIQUE INDEX ON traitements.zones_traitements (définition) ;
CREATE INDEX zones_traitements_geométrie ON traitements.zones_traitements USING GIST ( geom );
Please note that I specified the geometry type of the geom column in table traitements only because I wanted to specify the SRID but I was not sure what is the correct syntax for any type of Geometry. Maybe "geom geometry (Geometry, 2154)" ?
Merging all the geometries of the various sets of areas :
As said before, all the tables hold geometries of the type multipolygon.
This is the code I am using to merge all the geometries from one of the tables :
INSERT INTO traitements.zones_traitements
( définition, , geom )
VALUES
(
'first-level merge',
(
SELECT ST_Multi(ST_Collect(dumpedGeometries)) AS singleMultiGeometry
FROM
(
SELECT ST_Force2D((ST_Dump(geom)).geom) AS dumpedGeometries
FROM donnees.one_table
) AS dumpingGeometries
)
) ;
I found that some of the geometries in some of the records are in 3D, so that's why I am using _ST_Force2D_.
I do this for all the tables and then merge the geometries again using :
INSERT INTO traitements.zones_traitements
( définition, geom )
VALUES
(
'second-level merge',
(
SELECT ST_Multi(ST_Collect(dumpedGeometries)) AS singleMultiGeometry
FROM
(
SELECT (ST_Dump(geom)).geom AS dumpedGeometries
FROM traitements.zones_traitements
WHERE définition != 'second-level merge'
) AS dumpingGeometries
)
) ;
As said before, these queries take several minutes but that's fine.
Not the query that takes forever :
SELECT pkid
FROM traitements.sites_candidats AS sites
JOIN (
SELECT geom FROM traitements.zones_traitements
WHERE définition = 'zones_rédhibitoires' ) AS zones
ON ST_Contains(zones.geom , sites.geom)
LIMIT 50;
Analysing the problem :
Obviously, it is the subquery selecting the points that takes a lot of time, not the update.
So I have run an EXPLAIN (ANALYZE, BUFFERS) on the query :
EXPLAIN (ANALYZE, BUFFERS)
SELECT pkid
FROM traitements.sites_candidats AS sites
JOIN (
SELECT geom FROM traitements.zones_traitements
WHERE définition = 'second_level_merge' ) AS zones
ON ST_Contains(zones.geom , sites.geom)
LIMIT 10;
---------------------------------
"Limit (cost=4.18..20.23 rows=1 width=22) (actual time=6052.069..4393634.244 rows=10 loops=1)"
" Buffers: shared hit=1 read=688784"
" -> Nested Loop (cost=4.18..20.23 rows=1 width=22) (actual time=6052.068..4391938.803 rows=10 loops=1)"
" Buffers: shared hit=1 read=688784"
" -> Seq Scan on zones_traitements (cost=0.00..1.23 rows=1 width=54939392) (actual time=0.016..0.016 rows=1 loops=1)"
" Filter: (("définition")::text = 'zones_rédhibitoires'::text)"
" Rows Removed by Filter: 17"
" Buffers: shared hit=1"
" -> Bitmap Heap Scan on sites_candidats sites (cost=4.18..19.00 rows=1 width=54) (actual time=6052.044..4391260.053 rows=10 loops=1)"
" Recheck Cond: (zones_traitements.geom ~ geom)"
" Filter: _st_contains(zones_traitements.geom, geom)"
" Heap Blocks: exact=1"
" Buffers: shared read=688784"
" -> Bitmap Index Scan on "sites_candidats_geométrie" (cost=0.00..4.18 rows=4 width=0) (actual time=23.284..23.284 rows=3720 loops=1)"
" Index Cond: (zones_traitements.geom ~ geom)"
" Buffers: shared read=51"
"Planning time: 91.967 ms"
"Execution time: 4399271.394 ms"
I am not sure how to read this output.
Nevertheless, I suspect that the query is so slow because of the geometry obtained by merging all these multipolygons into a single one.
Questions :
Would that work better using a different type of geometry to merge the others, like a GeometryCollection ?
How does the indexes work in this case ?
Is there more efficient than ST_Contains() ?