3
votes

I've written the following PostgreSQL query which works as it should. However, it seems to be awfully slow, sometimes taking up to 10 seconds to return a result. I'm sure there is something in my statement that is causing this to be slow.

Can anyone help determine why this query is slow?

SELECT DISTINCT ON (school_classes.class_id,attendance_calendar.school_date)
  school_classes.class_id, school_classes.class_name, school_classes.grade_id
, school_gradelevels.linked_calendar, attendance_calendars.calendar_id
, attendance_calendar.school_date, attendance_calendar.minutes
, teacher_join_classes_subjects.staff_id, staff.first_name, staff.last_name  

FROM school_classes 
INNER JOIN school_gradelevels ON school_gradelevels.id=school_classes.grade_id 
INNER JOIN teacher_join_classes_subjects ON teacher_join_classes_subjects.class_id=school_classes.class_id 
INNER JOIN staff ON staff.staff_id=teacher_join_classes_subjects.staff_id 
INNER JOIN attendance_calendars ON attendance_calendars.title=school_gradelevels.linked_calendar 
INNER JOIN attendance_calendar ON attendance_calendar.calendar_id=attendance_calendars.calendar_id 

WHERE teacher_join_classes_subjects.syear='2013' 
AND staff.syear='2013' 
AND attendance_calendars.syear='2013' 
AND teacher_join_classes_subjects.does_attendance='Y' 
AND teacher_join_classes_subjects.subject_id IS NULL 
AND attendance_calendar.school_date<CURRENT_DATE 

AND attendance_calendar.school_date NOT IN (

SELECT com.school_date FROM attendance_completed com
WHERE  com.class_id=school_classes.class_id
AND   (com.period_id='101' AND attendance_calendar.minutes>='151' OR
       com.period_id='95'  AND attendance_calendar.minutes='150') )

I replaced the NOT IN with the following:

AND NOT EXISTS (
    SELECT com.school_date
    FROM attendance_completed com
    WHERE com.class_id=school_classes.class_id
    AND com.school_date=attendance_calendar.school_date
    AND (com.period_id='101' AND attendance_calendar.minutes>='151' OR
         com.period_id='95'  AND attendance_calendar.minutes='150') )

Result of EXPLAIN ANALYZE:

Unique  (cost=2998.39..2998.41 rows=3 width=85) (actual time=10751.111..10751.118 rows=1 loops=1)
  ->  Sort  (cost=2998.39..2998.40 rows=3 width=85) (actual time=10751.110..10751.110 rows=2 loops=1)
        Sort Key: school_classes.class_id, attendance_calendar.school_date
        Sort Method:  quicksort  Memory: 25kB
        ->  Hash Join  (cost=2.03..2998.37 rows=3 width=85) (actual time=6409.471..10751.045 rows=2 loops=1)
              Hash Cond: ((teacher_join_classes_subjects.class_id = school_classes.class_id) AND (school_gradelevels.id = school_classes.grade_id))
              Join Filter: (NOT (SubPlan 1))
              ->  Nested Loop  (cost=0.00..120.69 rows=94 width=81) (actual time=2.468..1187.397 rows=26460 loops=1)
                    Join Filter: (attendance_calendars.calendar_id = attendance_calendar.calendar_id)
                    ->  Nested Loop  (cost=0.00..42.13 rows=1 width=70) (actual time=0.087..3.247 rows=735 loops=1)
                          Join Filter: ((attendance_calendars.title)::text = (school_gradelevels.linked_calendar)::text)
                          ->  Nested Loop  (cost=0.00..40.80 rows=1 width=277) (actual time=0.077..1.005 rows=245 loops=1)
                                ->  Nested Loop  (cost=0.00..39.61 rows=1 width=27) (actual time=0.064..0.572 rows=49 loops=1)
                                      ->  Seq Scan on teacher_join_classes_subjects  (cost=0.00..10.48 rows=4 width=14) (actual time=0.022..0.143 rows=49 loops=1)
                                            Filter: ((subject_id IS NULL) AND (syear = 2013::numeric) AND ((does_attendance)::text = 'Y'::text))
                                      ->  Index Scan using staff_pkey on staff  (cost=0.00..7.27 rows=1 width=20) (actual time=0.006..0.007 rows=1 loops=49)
                                            Index Cond: (staff.staff_id = teacher_join_classes_subjects.staff_id)
                                            Filter: (staff.syear = 2013::numeric)
                                ->  Seq Scan on attendance_calendars  (cost=0.00..1.18 rows=1 width=250) (actual time=0.003..0.006 rows=5 loops=49)
                                      Filter: (attendance_calendars.syear = 2013::numeric)
                          ->  Seq Scan on school_gradelevels  (cost=0.00..1.15 rows=15 width=11) (actual time=0.001..0.005 rows=15 loops=245)
                    ->  Seq Scan on attendance_calendar  (cost=0.00..55.26 rows=1864 width=18) (actual time=0.003..1.129 rows=1824 loops=735)
                          Filter: (attendance_calendar.school_date   Hash  (cost=1.41..1.41 rows=41 width=18) (actual time=0.040..0.040 rows=41 loops=1)
                    ->  Seq Scan on school_classes  (cost=0.00..1.41 rows=41 width=18) (actual time=0.006..0.015 rows=41 loops=1)
              SubPlan 1
                ->  Seq Scan on attendance_completed com  (cost=0.00..958.28 rows=5 width=4) (actual time=0.228..5.411 rows=17 loops=1764)
                      Filter: ((class_id = $0) AND (((period_id = 101::numeric) AND ($1 >= 151::numeric)) OR ((period_id = 95::numeric) AND ($1 = 150::numeric))))
1
instead of NOT IN, if I DO AND NOT EXISTS, then the entire thing runs super fast, so I am assuming something wrong in the NOT IN statement. Any advice?John Smith
I've solved the problem by using the NOT EXISTS instead of using NOT IN. It's now become super fast.John Smith
Do you really get the same results? I belive NOT EXISTS just checks if the "inner" query returns any rows. Just changing NOT IN to NOT EXISTS in your query should actually not work due to syntax error. Could you paste the result of an EXPLAIN ANALYZE on your original query?Petter Engström
Thanks for your reply Petter, i've updated it with the EXPLAIN ANALYZE result. and also included the NOT EXISTS statement that seems to be helping.John Smith

1 Answers

2
votes

NOT EXISTS is an excellent choice. Almost always better than NOT IN. More details here. I simplified your query a bit (which looks fine, generally):

SELECT DISTINCT ON (c.class_id, a.school_date)
       c.class_id, c.class_name, c.grade_id
      ,g.linked_calendar, aa.calendar_id
      ,a.school_date, a.minutes
      ,t.staff_id, s.first_name, s.last_name  
FROM   school_classes                c
JOIN   teacher_join_classes_subjects t  USING (class_id)
JOIN   staff                         s  USING (staff_id)
JOIN   school_gradelevels            g  ON g.id = c.grade_id 
JOIN   attendance_calendars          aa ON aa.title = g.linked_calendar 
JOIN   attendance_calendar           a  ON a.calendar_id = aa.calendar_id 
WHERE  t.syear = 2013
AND    s.syear = 2013
AND    aa.syear = 2013
AND    t.does_attendance = 'Y'   -- looks like it should be boolean!
AND    t.subject_id IS NULL 
AND    a.school_date < CURRENT_DATE 
AND NOT EXISTS (
   SELECT 1
   FROM   attendance_completed x
   WHERE  x.class_id = c.class_id
   AND    x.school_date = a.school_date
   AND   (x.period_id = 101 AND a.minutes >= 151 OR  -- actually numbers?
          x.period_id =  95 AND a.minutes  = 150)
   )
ORDER BY c.class_id, a.school_date, ???

What seems to be missing is ORDER BY which should accompany your DISTINCT ON. Add more ORDER BY items in place of ???. If there are duplicates to pick from, you probably want to define which to pick.

Numeric literals don't need single quotes and boolean values should be coded as such.
You may want to revisit the chapter about data types.