Redshift Alternative for Correlated Sub-Query

Question

I am using Redshift and need an alternative for a correlated subquery. I am getting the correlated subquery not supported error. However, for this particular exercise of trying to identify all sales transactions made by the same customer within a given hour from the originating transaction, I am not sure a traditional left join would work either. I.e., the query is dependent on the context or current value from the parent select. I have also tried something similar using row_number() window function but again, need a way to window / partition on a date range - not just customer_id.

The overall goal is to find the first sales transaction for a given customer id, then find all subsequent transactions made within 60 minutes of the first transaction. This logic will continue on for the remainder of the transactions for the same customer (and ultimately all customers in the database). That is, once the initial 60 minute window has been established from the time of the first transaction, a second 60 minute window would begin at the end of the first 60 minute window, and all transactions within the second window would also be identified and combined and then repeat for the remainder of transactions.

The output would list the first transaction id that started the 60 minute window, then the other subsequent transaction ids that were made within the 60 minute window. The 2nd row would display the first transaction id made by the same customer in the next 60 minute window (again, the first transaction post the first 60 minute window would be the start of the second 60 minute window) and then the subsequent transactions also made within the second 60 minute window.

The query example in its most basic form looks like the query below:

select
s1.customer_id,
s1.transaction_id,
s1.order_time,
(
    select
        s2.transaction_id
    from
        sales s2
    where
        s2.order_time > s1.order_time and
        s2.order_time <= dateadd(m,60,s1.order_time) and
        s2.customer_id = s1.customer_id
    order by
        s2.order_time asc
    limit 1
) as sales_transaction_id_1,
(
    select
        s3.transaction_id
    from
        sales s3
    where
        s3.order_time > s1.order_time and
        s3.order_time <= dateadd(m,60,s1.order_time) and
        s3.customer_id = s1.customer_id
    order by
        s3.order_time asc
    limit 1 offset 1
) as sales_transaction_id_2,
(
    select
        s3.transaction_id
    from
        sales s4
    where
        s4.order_time > s1.order_time and
        s4.order_time <= dateadd(m,60,s1.order_time) and
        s4.customer_id = s1.customer_id
    order by
        s4.order_time asc
    limit 1 offset 1
) as sales_transaction_id_3
from
    (
        select 
            sales.customer_id,
            sales.transaction_id,
            sales.order_time
        from
            sales
        order by
            sales.order_time desc
    ) s1;

For example, if a customer made the following transactions:

customer_id     transaction_id      order_time          
1234                33453           2017-06-05 13:30
1234                88472           2017-06-05 13:45
1234                88477           2017-06-05 14:10

1234                99321           2017-06-07 8:30
1234                99345           2017-06-07 8:45

The expected output would be as:

customer_id     transaction_id  sales_transaction_id_1 sales_transaction_id_2   sales_transaction_id_3
1234                33453           88472                   88477                   NULL
1234                99321           99345                   NULL                    NULL

Also, it appears Redshift does not support lateral joins which seems to further restrict the options at my disposal. Any help would be greatly appreciated.

Edit your question and provide sample data and desired results . . . and an explanation of what the logic is supposed to be doing. — Gordon Linoff
Do they support window-functions? CTEs? BTW: you are only selecting from s1 so the other two legs of the query can be replaced by EXISTS(...) (getting rid of the ugly LIMIT 1, too) — wildplasser
combined_transaction_id_1 is both used as alias and column which is confusing, as @GordonLinoff said please provide sample data and what exactly you want to do. — sia
Oh, it is a scalar subquery... Just join the two tables from the subqueries and add the NOT EXISTS(...) to the two ON` conditions. — wildplasser

Gordon Linoff Gordon Linoff · Accepted Answer · 2017-06-05T21:17:15

From your description, you just want group by and some sort of date difference. I'm not sure how you want to combine the rows, but here is the basic idea:

select s.customer_id,
       min(order_time) as first_order_in_hour,
       max(order_time) as last_order_in_hour,
       count(*) as num_orders
from (select s.*,
             min(order_time) over (partition by customer_id) as min_ot
      from sales s
     ) s
group by customer_id, floor(datediff(second, min_ot, order_time) / (60 * 60));

This formulation (or something similar because Postgres does not have datediff()) would also be much faster in Postgres.

Redshift Alternative for Correlated Sub-Query

2 Answers