0
votes

I try to run a big query which involve joining 5 tables. The query need to run every half hour, most time it completes in matter of seconds, sometimes it takes hours, worst scenario is it fails at the end and error is "Connection error, please try again". I'm not using "join each" but "join" only due to performance reason and underlying dataset seems not too big for now.

The query is:

select 
    '2014-10-13' as date,
    i.eventId as event_id,
    e.event_name as event_name,
    datediff('2014-10-13', e.event_start_date)+1 as event_day,
    i.impressionCts as total_viewers,
    round(i.top10Cts*100/i.impressionCts) as percent_pos_top10,
    round(i.ct1020*100/i.impressionCts) as percent_pos_1020,
    round(i.ct2030*100/i.impressionCts) as percent_pos_2030,
    round(i.ct3050*100/i.impressionCts) as percent_pos_3050,
    round(i.over50Cts*100/i.impressionCts) as percent_pos_50above,

    p.purchasers as purchasers,
    round(p.purchasers*100/i.impressionCts) as conversion_rate_view_to_purchase,
    p.total_units as total_purchased_units,
    round(p.total_demand,2) as total_purchased_demand,

    ck.total_clickiers as total_clickers,
    ck.total_clicks as total_clicks,
    round(ck.total_clickiers*100/i.impressionCts) as conversion_rate_view_to_click,
    round(ck.clickers_pos_top10*100/i.top10Cts) as conversion_rate_view_to_click_pos_top10,
    round(ck.clickers_pos_1020*100/i.ct1020) as conversion_rate_view_to_click_pos_1020,
    round(ck.clickers_pos_2030*100/i.ct2030) as conversion_rate_view_to_click_pos_2030,
    round(ck.clickers_pos_3050*100/i.ct3050) as conversion_rate_view_to_click_pos_3050,
    round(ck.clickers_pos_above50Cts/i.over50Cts) as conversion_rate_view_to_click_pos_50above,

    ca.cartadders as total_cartadders,
    ca.qty as total_qty_addedToCart,
    round(ca.cartadders*100/ck.total_clickiers) as conversion_rate_click_to_cartadd,

    round(p.purchasers*100/ca.cartadders) as conversion_rate_cartadd_to_purchase
from
(select integer(eventId) as eventId,count(distinct customerId) as impressionCts, 
        count(distinct if(position < 10, customerId,null) ) as top10Cts,
        count(distinct if(position between 10 and 19, customerId,null)) as ct1020,  
        count(distinct if(position between 20 and 29, customerId,null)) as ct2030,
        count(distinct if(position between 30 and 49, customerId,null)) as ct3050,
        count(distinct if(position >= 50, customerId,null)) as over50Cts
from clickstream.event_impression_20141013 group each by eventId) i
join 
(select magenta_event_id,event_name,event_start_date,event_end_date from zudw.event where date(event_start_date)<='2014-10-13' and  date(event_end_date)>='2014-10-13') e
on e.magenta_event_id = i.eventId
left join 
(select 
        integer(eventId) as eventId,count(distinct customerId) as total_clickiers, count(customerId) as total_clicks,
        count(distinct if(position < 10, customerId,null) ) as clickers_pos_top10,
        count(distinct if(position between 10 and 19, customerId,null)) as clickers_pos_1020,  
        count(distinct if(position between 20 and 29, customerId,null)) as clickers_pos_2030,
        count(distinct if(position between 30 and 49, customerId,null)) as clickers_pos_3050,
        count(distinct if(position >= 50, customerId,null)) as clickers_pos_above50Cts        
 from clickstream.click_20141013 where type='event' 
 and (refererUrl is null or refererUrl='http://www.zulily.com/' or instr(refererUrl,'.zulily.com/new-today/')>0 or instr(refererUrl,'.zulily.com/newToday')>0 
      or instr(refererUrl,'.zulily.com/endsSoon')>0 or instr(refererUrl,'.zulily.com/toys')>0 or instr(refererUrl,'.zulily.com/ready')>0
      or instr(refererUrl,'.zulily.com/lastDay/')>0 or instr(refererUrl,'.zulily.com/shopByAge/')>0 or instr(refererUrl,'.zulily.com/shopByCategory/')>0 
      or instr(refererUrl,'.zulily.com/girls')>0 or instr(refererUrl,'.zulily.com/boys')>0 or instr(refererUrl,'.zulily.com/women')>0 or instr(refererUrl,'.zulily.com/men')>0 
      or instr(refererUrl,'.zulily.com/home')>0 or instr(refererUrl,'.zulily.com/shoes')>0 or instr(refererUrl,'.zulily.com/health')>0 or instr(refererUrl,'.zulily.com/baby')>0)
 group each by eventId
) ck on i.eventId=ck.eventId
left join  
(select integer(eventId) as eventId, count(distinct customerId) as cartadders, sum(qty) as qty
 from clickstream.cartadd_20141013 group each by eventId
) ca on i.eventId=ca.eventId
left join  
(select integer(eventId) as eventId,
        count(distinct customerId) as purchasers, 
        sum(qty) as total_units,
        sum(price) as total_demand
        from clickstream.purchase_20141013
        group each by eventId
) p on i.eventId = p.eventId
where date(e.event_start_date)<='2014-10-13'
1
Do you have a job id for a job that failed with connection error? - Jordan Tigani

1 Answers

0
votes

A couple of things:

  1. If you are hitting an urgent issue, it is usually best to contact support. StackOverflow tends to be populated by engineers who check in somewhat infrequently and other community members who may not be able to help with an urgent request.
  2. If you provide a job id (with project id), it makes it a lot easier to look up what is happening with your queries.
  3. If your non-shuffled join (as in JOIN EACH) queries are taking a long time, there may be some additional things you can do to tune them.
    1. Try using JOIN EACH. While I realize there is currently an intermittent issue with JOIN EACH, if you're seeing queries taking on the order of hours, you're probably hitting resource limits that JOIN EACH can help with.
    2. Verify that you don't have a JOIN explosion. That is, make sure that each key on the left only matches one key on the right. Sometimes people think this is the case, but there can be an empty or dummy value that can cause multiple matches.
    3. Try pushing the where clause inside a subselect so that filters happen before the join, if possible.
    4. Try making sure that the right hand tables (i.e. all tables that aren't in the original FROM clause) are the smaller tables.

With more information about your job, I can look up what actually happened in the logs to see why you got a connection error and your previous queries too so long.