I am trying to query a redshift cluster with 2 dc2.Large nodes using the following query:
SELECT
orders.id as order_id,
orders.billhash as order_billhash,
orders.shiphash as order_shiphash,
mailings.id as mailing_id,
mailings.addresshash,
null as holdout_id,
coalesce(orders.customer_id, orders.email_hash) as coal_col
FROM
"orders"
JOIN printer_mailings ON (
orders.billhash = mailings.addresshash
or orders.shiphash = mailings.addresshash
)
WHERE
order_date :: date between '2020-11-28' and '2021-04-27'
AND "orders"."client_slug" = 'some client'
AND "orders"."exclusion_reason" IS NULL
AND "mailings"."file_id" in (select id from files where file_type = 'Mail File' and print_job_number = '12345')
The orders table has about 20 million records, and the mailings table has about 312 million. The files table however has about 5,000 so in accordance with redshift best practices I used a subquery there instead of another join. This query times out after 15 minutes in redshift whereas a query in Postgres runs in a minutes or two.
I suspect it's probably the or
in the join clause or the coalesce?
EDIT: Here is the result of Explain
XN Nested Loop DS_BCAST_INNER (cost=524178.31..813478083.27 rows=5 width=387)
Join Filter: ((("inner".shiphash)::text = ("outer".addresshash)::text) OR (("inner".billhash)::text = ("outer".addresshash)::text))
-> XN Hash IN Join DS_DIST_ALL_NONE (cost=81.88..7037062.97 rows=343645 width=108)
Hash Cond: (("outer".file_id)::text = ("inner".id)::text)
-> XN Seq Scan on mailings (cost=0.00..3126019.84 rows=312601984 width=148)
-> XN Hash (cost=81.87..81.87 rows=6 width=40)
-> XN Seq Scan on slm_files (cost=0.00..81.87 rows=6 width=40)
Filter: (((print_job_number)::text = '12345'::text) AND ((file_type)::text = 'Mail File'::text))
-> XN Materialize (cost=524096.42..524099.21 rows=279 width=279)
-> XN Seq Scan on orders (cost=0.00..443744.14 rows=279 width=279)
Basically we need to see if there is a join to mailings on a billing address hash or shipping address hash
or
pretty much kills performance. I would suggest that you ask a question with sample data, desired results, and an explanation of the logic you want to implement. – Gordon Linoffexplain
output. Ask a question with sample data, desired results, and an explanation of the logic you want to implement. I can pretty much guess what the execution plan is. That is just a symptom of the problem. – Gordon Linoff