0
votes

I have a table with several million rows, and the query I'm running is starting to bog down (more data goes into it everyday). I'm just a lowly web dev, so I kinda muddle along when it comes to database-related tasks. I suspect that a couple of indexes will correct this, or potentially some query rewrites, but was hoping someone would point me in the right direction. Here's the query:

   SELECT 
       impressionreport.sitecode AS site, impressionreport.advertisername AS advertiser, impressionreport.mediafilename AS filename,
       SUM(impressionreport.mediafileplays) AS totalplays, SUM(impressionreport.views) AS totalviewers, SUM(impressionreport.impressions) AS totalimpressions, 
       SUM(CASE WHEN impressionreport.gender LIKE 'male' THEN impressionreport.views ELSE 0 END) AS totalmale, 
       SUM(CASE WHEN impressionreport.gender LIKE 'female' THEN impressionreport.views ELSE 0 END) AS totalfemale, 
       SUM(CASE WHEN impressionreport.gender LIKE 'female' AND agegroup.name LIKE 'young' THEN impressionreport.views ELSE 0 END) AS femaleyoung, 
       SUM(CASE WHEN impressionreport.gender LIKE 'female' AND agegroup.name LIKE 'young adult' THEN impressionreport.views ELSE 0 END) AS femaleyoungadult, 
       SUM(CASE WHEN impressionreport.gender LIKE 'female' AND agegroup.name LIKE 'adult' THEN impressionreport.views ELSE 0 END) AS femaleadult, 
       SUM(CASE WHEN impressionreport.gender LIKE 'female' AND agegroup.name LIKE 'senior' THEN impressionreport.views ELSE 0 END) AS femalesenior, 
       SUM(CASE WHEN impressionreport.gender LIKE 'male' AND agegroup.name LIKE 'young' THEN impressionreport.views ELSE 0 END) AS maleyoung, 
       SUM(CASE WHEN impressionreport.gender LIKE 'male' AND agegroup.name LIKE 'young adult' THEN impressionreport.views ELSE 0 END) AS maleyoungadult, 
       SUM(CASE WHEN impressionreport.gender LIKE 'male' AND agegroup.name LIKE 'adult' THEN impressionreport.views ELSE 0 END) AS maleadult, 
       SUM(CASE WHEN impressionreport.gender LIKE 'male' AND agegroup.name LIKE 'senior' THEN impressionreport.views ELSE 0 END) AS malesenior 
   FROM impressionreport 
   LEFT JOIN agegroup ON impressionreport.age >= agegroup.min AND impressionreport.age <= agegroup.max 
   WHERE 
       impressionreport.datelocal >= '5-1-2018' AND
       impressionreport.datelocal < '5-15-2018' AND
       impressionreport.network LIKE '%' AND
       impressionreport.sitecode LIKE '%' AND
       impressionreport.devicename LIKE '%' AND
       impressionreport.advertisername LIKE '%' AND
       impressionreport.mediafilename LIKE '%'
   GROUP BY impressionreport.sitecode, impressionreport.advertisername, impressionreport.mediafilename
   ORDER BY impressionreport.sitecode, impressionreport.advertisername, impressionreport.mediafilename

There are indexes on datelocal, devicename, sitecode, advertisername, mediafilename, gender, age, network (all btree).

EDIT:

GroupAggregate (cost=197785.58..223336.77 rows=533798 width=161) (actual time=3789.770..5819.410 rows=4577 loops=1) Group Key: impressionreport.sitecode, impressionreport.advertisername, impressionreport.mediafilename -> Sort (cost=197785.58..198469.86 rows=1368560 width=103) (actual time=3789.651..4450.374 rows=1384106 loops=1) Sort Key: impressionreport.sitecode, impressionreport.advertisername, impressionreport.mediafilename Sort Method: external merge Disk: 119504kB -> Nested Loop Left Join (cost=0.09..116428.54 rows=1368560 width=103) (actual time=0.029..1485.883 rows=1384106 loops=1) Join Filter: ((impressionreport.age >= agegroup.min) AND (impressionreport.age <= agegroup.max)) Rows Removed by Join Filter: 4885607 -> Index Scan using impressionreport_datelocal_index on impressionreport (cost=0.09..91793.44 rows=1368560 width=75) (actual time=0.020..443.316 rows=1384106 loops=1) Index Cond: ((datelocal >= '2018-05-01 00:00:00'::timestamp without time zone) AND (datelocal < '2018-05-15 00:00:00'::timestamp without time zone)) -> Materialize (cost=0.00..1.02 rows=4 width=40) (actual time=0.000..0.000 rows=4 loops=1384106) -> Seq Scan on agegroup (cost=0.00..1.01 rows=4 width=40) (actual time=0.004..0.005 rows=4 loops=1) Planning time: 0.270 ms Execution time: 5838.433 ms

2
There is an extension that called pg_trgm. It is working very well for like queries. You should try it.Yavuz Selim

2 Answers

0
votes

Try an index on impressionreport (datelocal, sitecode, advertisername, mediafilename) (compound, i.e. one index for the complete list of columns, not one for each column).

Check the execution plan, to see if the index gets picked up.


Edit:

I was wrong about the LIKEs. I missed that it could filter for datelocal before checking for the LIKEs even if they don't get optimized away.

0
votes

-> Sort (cost=197785.58..198469.86 rows=1368560 width=103) (actual time=3789.651..4450.374 rows=1384106 loops=1) Sort Key: impressionreport.sitecode, impressionreport.advertisername, impressionreport.mediafilename Sort Method: external merge Disk: 119504kB

3789.651..4450.374 means that it took 3789 ms for the first row and 4450 ms for all rows to complete that part of the query.

So the slowest part is the sorting and it's slow because it decided to do the sorting on disk. This is probably because it has to little work_mem

Try tuning postgresql.conf with the help of https://pgtune.leopard.in.ua just enter how much ram you have and hit generate