Simple redshift select query times out after 15 minutes

Question

I am trying to query a redshift cluster with 2 dc2.Large nodes using the following query:

SELECT
  orders.id as order_id,
  orders.billhash as order_billhash,
  orders.shiphash as order_shiphash,
  mailings.id as mailing_id,
  mailings.addresshash,
  null as holdout_id,
  coalesce(orders.customer_id, orders.email_hash) as coal_col
FROM
  "orders"
  JOIN printer_mailings ON (
    orders.billhash = mailings.addresshash
    or orders.shiphash = mailings.addresshash
  )
WHERE
  order_date :: date between '2020-11-28' and '2021-04-27'
  AND "orders"."client_slug" = 'some client'
  AND "orders"."exclusion_reason" IS NULL
  AND "mailings"."file_id" in (select id from files where file_type = 'Mail File' and print_job_number = '12345')

The orders table has about 20 million records, and the mailings table has about 312 million. The files table however has about 5,000 so in accordance with redshift best practices I used a subquery there instead of another join. This query times out after 15 minutes in redshift whereas a query in Postgres runs in a minutes or two.

I suspect it's probably the or in the join clause or the coalesce?

EDIT: Here is the result of Explain

XN Nested Loop DS_BCAST_INNER (cost=524178.31..813478083.27 rows=5 width=387)
Join Filter: ((("inner".shiphash)::text = ("outer".addresshash)::text) OR (("inner".billhash)::text = ("outer".addresshash)::text))
-> XN Hash IN Join DS_DIST_ALL_NONE (cost=81.88..7037062.97 rows=343645 width=108)
Hash Cond: (("outer".file_id)::text = ("inner".id)::text)
-> XN Seq Scan on mailings (cost=0.00..3126019.84 rows=312601984 width=148)
-> XN Hash (cost=81.87..81.87 rows=6 width=40)
-> XN Seq Scan on slm_files (cost=0.00..81.87 rows=6 width=40)
Filter: (((print_job_number)::text = '12345'::text) AND ((file_type)::text = 'Mail File'::text))
-> XN Materialize (cost=524096.42..524099.21 rows=279 width=279)
-> XN Seq Scan on orders (cost=0.00..443744.14 rows=279 width=279)

Basically we need to see if there is a join to mailings on a billing address hash or shipping address hash

These tables are actually pretty huge. I'd focus instead on ways of making the query go faster, e.g. by proper indexing. — Tim Biegeleisen
or pretty much kills performance. I would suggest that you ask a question with sample data, desired results, and an explanation of the logic you want to implement. — Gordon Linoff
@DBA108642 . . . I don't mean explain output. Ask a question with sample data, desired results, and an explanation of the logic you want to implement. I can pretty much guess what the execution plan is. That is just a symptom of the problem. — Gordon Linoff

Joe Harris Joe Harris · Accepted Answer · 2021-02-18T22:02:53

XN Nested Loop DS_BCAST_INNER suggests that you are creating a Cartesian product here because of the OR logic. It's usually better to explicitly join to the other table twice rather than use a conditional in the join.

Try this:

WITH mailings_cte AS (
    SELECT mailings.addresshash
         , mailings.id 
    FROM printer_mailings
    WHERE file_id IN (SELECT id 
                      FROM files 
                      WHERE file_type = 'Mail File' 
                      AND print_job_number = '12345')
)
SELECT orders.id                                         AS order_id
     , orders.billhash                                   AS order_billhash
     , orders.shiphash                                   AS order_shiphash
     , bill_mailings.id                                  AS bill_mailing_id
     , ship_mailings.id                                  AS ship_mailing_id
     , NULL                                              AS holdout_id
     , COALESCE( orders.customer_id, orders.email_hash ) AS coal_col
FROM orders
LEFT JOIN mailings_cte bill_mailings
       ON orders.billhash = bill_mailings.addresshash 
LEFT JOIN mailings_cte ship_mailings
       ON orders.shiphash = ship_mailings.addresshash
WHERE (bill_mailings.id IS NOT NULL 
   OR  ship_mailings.id IS NOT NULL)
  AND orders.order_date::DATE BETWEEN '2020-11-28' AND '2021-04-27'
  AND orders.client_slug = 'some client'
  AND orders.exclusion_reason IS NULL

Simple redshift select query times out after 15 minutes

1 Answers