I have 3 tables (green highlighted) and want to UNION ALL, LEADS and CONTACTS with each having 1 column not present (red highlighted). Also JOIN, CONTACTS table with ACCOUNTS to get 3 fields (so red highlighted field "code_uk" would be another column not present in LEADS after JOINING. Remaining columns I want to place one below the other (UNION ALL), as arranged in query.
Output Required:
- UNION ALL, LEADS and CONTACTS by arranging columns as done in query below - DONE
- Rows from CONTACTS where contact_email/contact_product is not present in LEADS. i.e. yellow highlighted rows 9,11,14 should not be present. - DONE
- Use STRUCT since there is mismatch in # of columns in LEADS & CONTACTS - DONE
Only unique "lead_email" i.e. blue highlighted rows 2&5, from LEADS should not be present. When I use distinct (lead_email), error is type STRUCT cannot be used in SELECT DISTINCT
- When I remove "distinct" just to test whether STRUCT works correctly or not, it outputs 15 rows but I want 13 rows (excluding 2&5).
- Tried "group by" but it is also not working
Can someone please help to fix (4)? Refer image
Query Used:
SELECT distinct(lead_email), lead_id, lead_product, struct<lead_reason string> (lead_reason) /*lead_reason*/, lead_date, lead_employee_count, lead_code,
struct<contact_cancel_date timestamp> (null) contact_cancel_date,/*null as contact_cancel_date*/ /*null as code_uk*/ struct<code_uk string> (null) code_uk FROM `sample_leads`
union all
select contact_email,contact_id, contact_product, struct<lead_reason string> (null) contact_reason, contact_date, employee,code_us,struct<contact_cancel_date timestamp> (contact_cancel_date),struct<code_uk string> (code_uk) from `sample_contacts`
left join
(
select account_id, employee, code_us, code_uk from `sample_accounts`)
on contact_id=account_id
where `sample_contacts`.contact_email NOT IN (SELECT lead_email FROM `sample_leads`)
OR `sample_contacts`.contact_product NOT IN (SELECT lead_product FROM `sample_leads`)
-- group by 1,2,3,5,6,7,8,9
-- order by lead_id
Note: "id" and "employee" columns are integer, "date" is timestamp, remaining columns are string.
Leads:
lead_id lead_email lead_product lead_reason lead_date lead_employee_count lead_code
1 [email protected] msoffice abc 2020-02-23 07:30:02 UTC 1000 1005-C
2 [email protected] chrome pqr1 2020-02-23 07:30:02 UTC 2000 2006-B
3 [email protected] iphone xyz 2020-02-23 07:30:02 UTC 3000 3007-A
4 [email protected] macbook zzz 2020-02-23 07:30:02 UTC 4000 4008-B
5 [email protected] itunes xyz1 2020-02-23 07:30:02 UTC 5000 5001-A
6 [email protected] googlecloud xyz2 2020-02-23 07:30:02 UTC 6000 6002-B
7 [email protected] yahoomail junk 2020-02-23 07:30:02 UTC 7000 7003-A
8 [email protected] 2020-02-23 07:30:02 UTC 8000 8004-B
2 [email protected] chrome pqr1 2020-02-23 07:30:02 UTC 2000 2006-B
5 [email protected] itunes xyz1 2020-02-23 07:30:02 UTC 5000 5001-A
Contacts:
contact_id contact_email contact_product contact_date contact_cancel_date
9 msoffice 2010-01-23 07:30:01 UTC 2020-02-23
07:30:02 UTC
10 [email protected] playstore 2010-01-23 07:30:01 UTC 2020-02-23 07:30:02 UTC
11 [email protected] 2010-01-23 07:30:01 UTC 2020-02-23 07:30:02 UTC
12 [email protected] 2010-01-23 07:30:01 UTC 2020-02-23 07:30:02 UTC
13 [email protected] ipod 2010-01-23 07:30:01 UTC 2020-02-23 07:30:02 UTC
14 googlecloud 2010-01-23 07:30:01 UTC 2020-02-23 07:30:02 UTC
15 [email protected] 2010-01-23 07:30:01 UTC 2020-02-23 07:30:02 UTC
16 [email protected] 2010-01-23 07:30:01 UTC 2020-02-23 07:30:02 UTC
Accounts:
account_id employee code_us code_uk
9 100 001-A 450-a
10 200 002-B 451-a
11 300 003-A 452-a
12 400 004-B 453-a
13 500 005-C 454-a
14 600 006-B 455-a
15 700 007-A 456-a
16 800 008-B 457-a