13
votes

I'm trying to run a query joining 2 large sets of data and I'm hitting the resources exceeded during query execution error. I've read that there are work around when using Join Each and Group Each but not what those workaround would be.

SELECT 
  year(users.firstseen) as first_year,
  month(users.firstseen) as first_month, 
  DATEDIFF(orders.timestamp,users.firstseen) as days_elapsed,
  count(orders.user_key) as count_orders
FROM 
  [project.orders] as orders
JOIN EACH
  [project.users] AS users
ON
  orders.user_key = users.user_key
WHERE orders.store = 'ios'
GROUP EACH BY 1,2,3

Edit: the following worked:

SELECT
  year(users.firstseen) as firstyear,
  month(users.firstseen) as firstmonth,
  DATEDIFF(orders.timestamp, users.firstseen) as days_elapsed,
  COUNT(users.firstseen) AS count_orders FROM [project.orders] as orders
JOIN EACH( SELECT user_key, firstseen FROM [project.users]
WHERE store_key = 'ios') as users ON orders.user_key = users.user_key
GROUP BY firstyear, firstmonth, days_elapsed
ORDER BY firstyear, firstmonth, days_elapsed
1
this ended up working SELECT year(users.firstseen) as firstyear, month(users.firstseen) as firstmonth, DATEDIFF(orders.timestamp, users.firstseen) as days_elapsed, COUNT(users.firstseen) AS count_orders FROM [project.orders] as orders JOIN EACH( SELECT user_key, firstseen FROM [project.users] WHERE store_key = 'ios') as users ON orders.user_key = users.user_key GROUP BY firstyear, firstmonth, days_elapsed ORDER BY firstyear, firstmonth, days_elapseduser2388120

1 Answers

13
votes

JOIN EACH can fail if your join keys (in this case, user_key) are unevenly distributed. For example, if you have one user_key that appears abnormally often, you'll get a "resources exceeded" error from the node that handles that key. Alternatively, you could try running the query over a smaller set of user keys by filtering out some portion of the user keys before the join.

GROUP EACH BY can fail if you have too many distinct group keys. You could try whittling down the join output by adding a few more WHERE clauses in order to see if this is the case.

Basically, I'd recommend whittling down the inputs to either the JOIN EACH or the GROUP EACH BY until you get the query to work, and then you'll have a better sense for the limits you're running up against. Once you know that, you can (hopefully) structure your queries to get the most out of the available resources.

(BTW, we expect to tune these operations in the near future to remove some of the limits you may be hitting!)