1
votes

I am trying to build a table that would bring be a combination of all products that I could sell, based on the current ones.

Product Status Table

+-------------+--------------+----------------+
| customer_id | product_name | product_status |
+-------------+--------------+----------------+
|           1 | A            | Active         |
|           2 | B            | Active         |
|           2 | C            | Active         |
|           3 | A            | Cancelled      |
+-------------+--------------+----------------+

Now I am trying to cross join with a hard code table that would give be 4 rows per customer_id, based on all 4 product we have in our portfolio, and statuses that I would like to apply.

Portfolio Table


+--------------+------------+----------+
| product_name |  status_1  | status_2 |
+--------------+------------+----------+
| A            | Inelegible | Inactive |
| B            | Inelegible | Inactive |
| C            | Ineligible | Inactive |
| D            | Inelegible | Inactive |
+--------------+------------+----------+


On my code I tried to use a CROSS JOIN in order to achieve 4 rows per customer_id. Unfortunately, for customers that have more than one product, I have double/triple rows.

This is my code:

SELECT
    p.customer_id,
    CASE WHEN p.product_name = pt.product_name THEN p.product_name ELSE pt.product_name END AS product_name,
    CASE 
        WHEN p.product_name = pt.product_name THEN p.product_status 
        ELSE pt.status_1
    END AS product_status
FROM 
    products AS p
CROSS JOIN
    portfolio as pt

This is my current output:


+----+-------------+--------------+----------------+
| #  | customer_id | product_name | product_status |
+----+-------------+--------------+----------------+
|  1 |           1 | A            | Active         |
|  2 |           1 | B            | Inelegible     |
|  3 |           1 | C            | Inelegible     |
|  4 |           1 | D            | Inelegible     |
|  5 |           2 | A            | Ineligible     |
|  6 |           2 | A            | Ineligible     |
|  7 |           2 | B            | Active         |
|  8 |           2 | B            | Ineligible     |
|  9 |           2 | C            | Active         |
| 10 |           2 | C            | Ineligible     |
| 11 |           2 | D            | Ineligible     |
| 12 |           2 | D            | Ineligible     |
| 13 |           3 | A            | Cancelled      |
| 14 |           3 | B            | Ineligible     |
| 15 |           3 | C            | Ineligible     |
| 16 |           3 | D            | Ineligible     |
+----+-------------+--------------+----------------+

As you may see, for the customer_id 2, I have two rows for each product having products B and C with different statuses then what I have on the product_status table.

What I would like to achieve, in this case, is a table with 12 rows, in which the current product/status from the product_status table is shown, and the remaining product/statuses from the portfolio table are added.

Expected output


+----+-------------+--------------+----------------+
| #  | customer_id | product_name | product_status |
+----+-------------+--------------+----------------+
|  1 |           1 | A            | Active         |
|  2 |           1 | B            | Inelegible     |
|  3 |           1 | C            | Inelegible     |
|  4 |           1 | D            | Inelegible     |
|  5 |           2 | A            | Ineligible     |
|  6 |           2 | B            | Active         |
|  7 |           2 | C            | Active         |
|  8 |           2 | D            | Ineligible     |
|  9 |           3 | A            | Cancelled      |
| 10 |           3 | B            | Ineligible     |
| 11 |           3 | C            | Ineligible     |
| 12 |           3 | D            | Ineligible     |
+----+-------------+--------------+----------------+

Not sure if the CROSS JOIN is the best alternative, but now I am running out of ideas.

1
expected output ??Srinivas
@Srinivas done!Heber Brandao
@Srinivas I am using pure hive. not sparkHeber Brandao

1 Answers

0
votes

EDIT:

I thought of another cleaner solution. Do a cross join first, then a right join on the customer_id and product_name, and coalesce the product statuses.

SELECT customer_id, product_name, coalesce(product_status, status_1)
FROM products p
RIGHT JOIN (
    SELECT * 
    FROM (SELECT DISTINCT customer_id FROM products) pro
    CROSS JOIN portfolio
) pt
USING (customer_id, product_name)
ORDER BY customer_id, product_name

Old answer: The idea is to include information of all product names for a customer_id into a list, and check whether the product in portfolio is in that list.

(SELECT customer_id, pt_product_name as product_name, first(status_1) as product_status
FROM (
    SELECT
        customer_id,
        p.product_name as p_product_name,
        pt.product_name as pt_product_name,
        product_status,
        status_1,
        status_2,
        collect_list(p.product_name) over (partition by customer_id) AS product_list
    FROM products p
    CROSS JOIN portfolio pt
    )
WHERE NOT array_contains(product_list, pt_product_name)
GROUP BY customer_id, product_name)

UNION ALL

(SELECT customer_id, p_product_name as product_name, first(product_status) as product_status
FROM (
    SELECT
        customer_id,
        p.product_name as p_product_name,
        pt.product_name as pt_product_name,
        product_status,
        status_1,
        status_2,
        collect_list(p.product_name) over (partition by customer_id) AS product_list 
    FROM products p
    CROSS JOIN portfolio pt)
WHERE array_contains(product_list, pt_product_name)
GROUP BY customer_id, product_name)

ORDER BY customer_id, product_name;

which gives

+-----------+------------+--------------+
|customer_id|product_name|product_status|
+-----------+------------+--------------+
|          1|           A|        Active|
|          1|           B|    Inelegible|
|          1|           C|    Ineligible|
|          1|           D|    Inelegible|
|          2|           A|    Inelegible|
|          2|           B|        Active|
|          2|           C|        Active|
|          2|           D|    Inelegible|
|          3|           A|     Cancelled|
|          3|           B|    Inelegible|
|          3|           C|    Ineligible|
|          3|           D|    Inelegible|
+-----------+------------+--------------+

FYI the chunk before UNION ALL gives:

+-----------+------------+--------------+
|customer_id|product_name|product_status|
+-----------+------------+--------------+
|          1|           B|    Inelegible|
|          1|           C|    Ineligible|
|          1|           D|    Inelegible|
|          2|           A|    Inelegible|
|          2|           D|    Inelegible|
|          3|           B|    Inelegible|
|          3|           C|    Ineligible|
|          3|           D|    Inelegible|
+-----------+------------+--------------+

And the chunk after UNION ALL gives:

+-----------+------------+--------------+
|customer_id|product_name|product_status|
+-----------+------------+--------------+
|          1|           A|        Active|
|          2|           B|        Active|
|          2|           C|        Active|
|          3|           A|     Cancelled|
+-----------+------------+--------------+

Hope that helps!