1
votes

I am trying to query a redshift cluster with 2 dc2.Large nodes using the following query:

SELECT
  orders.id as order_id,
  orders.billhash as order_billhash,
  orders.shiphash as order_shiphash,
  mailings.id as mailing_id,
  mailings.addresshash,
  null as holdout_id,
  coalesce(orders.customer_id, orders.email_hash) as coal_col
FROM
  "orders"
  JOIN printer_mailings ON (
    orders.billhash = mailings.addresshash
    or orders.shiphash = mailings.addresshash
  )
WHERE
  order_date :: date between '2020-11-28' and '2021-04-27'
  AND "orders"."client_slug" = 'some client'
  AND "orders"."exclusion_reason" IS NULL
  AND "mailings"."file_id" in (select id from files where file_type = 'Mail File' and print_job_number = '12345')

The orders table has about 20 million records, and the mailings table has about 312 million. The files table however has about 5,000 so in accordance with redshift best practices I used a subquery there instead of another join. This query times out after 15 minutes in redshift whereas a query in Postgres runs in a minutes or two.

I suspect it's probably the or in the join clause or the coalesce?

EDIT: Here is the result of Explain

XN Nested Loop DS_BCAST_INNER (cost=524178.31..813478083.27 rows=5 width=387)
Join Filter: ((("inner".shiphash)::text = ("outer".addresshash)::text) OR (("inner".billhash)::text = ("outer".addresshash)::text))
-> XN Hash IN Join DS_DIST_ALL_NONE (cost=81.88..7037062.97 rows=343645 width=108)
Hash Cond: (("outer".file_id)::text = ("inner".id)::text)
-> XN Seq Scan on mailings (cost=0.00..3126019.84 rows=312601984 width=148)
-> XN Hash (cost=81.87..81.87 rows=6 width=40)
-> XN Seq Scan on slm_files (cost=0.00..81.87 rows=6 width=40)
Filter: (((print_job_number)::text = '12345'::text) AND ((file_type)::text = 'Mail File'::text))
-> XN Materialize (cost=524096.42..524099.21 rows=279 width=279)
-> XN Seq Scan on orders (cost=0.00..443744.14 rows=279 width=279)

Basically we need to see if there is a join to mailings on a billing address hash or shipping address hash

1
These tables are actually pretty huge. I'd focus instead on ways of making the query go faster, e.g. by proper indexing.Tim Biegeleisen
Redshift doesn't have indexesDBA108642
or pretty much kills performance. I would suggest that you ask a question with sample data, desired results, and an explanation of the logic you want to implement.Gordon Linoff
updated with the explain output and some logicDBA108642
@DBA108642 . . . I don't mean explain output. Ask a question with sample data, desired results, and an explanation of the logic you want to implement. I can pretty much guess what the execution plan is. That is just a symptom of the problem.Gordon Linoff

1 Answers

0
votes

XN Nested Loop DS_BCAST_INNER suggests that you are creating a Cartesian product here because of the OR logic. It's usually better to explicitly join to the other table twice rather than use a conditional in the join.

Try this:

WITH mailings_cte AS (
    SELECT mailings.addresshash
         , mailings.id 
    FROM printer_mailings
    WHERE file_id IN (SELECT id 
                      FROM files 
                      WHERE file_type = 'Mail File' 
                      AND print_job_number = '12345')
)
SELECT orders.id                                         AS order_id
     , orders.billhash                                   AS order_billhash
     , orders.shiphash                                   AS order_shiphash
     , bill_mailings.id                                  AS bill_mailing_id
     , ship_mailings.id                                  AS ship_mailing_id
     , NULL                                              AS holdout_id
     , COALESCE( orders.customer_id, orders.email_hash ) AS coal_col
FROM orders
LEFT JOIN mailings_cte bill_mailings
       ON orders.billhash = bill_mailings.addresshash 
LEFT JOIN mailings_cte ship_mailings
       ON orders.shiphash = ship_mailings.addresshash
WHERE (bill_mailings.id IS NOT NULL 
   OR  ship_mailings.id IS NOT NULL)
  AND orders.order_date::DATE BETWEEN '2020-11-28' AND '2021-04-27'
  AND orders.client_slug = 'some client'
  AND orders.exclusion_reason IS NULL