1
votes

I'm trying to run a simple query which joins too large dataset but I'm running into various errors. Reproduced here is a similar query using a public database

SELECT gn1.actor_attributes.blog, gn1.actor_attributes.company, gn1.actor_attributes.email, gn1.actor_attributes.gravatar_id, gn1.actor_attributes.location, gn1.actor_attributes.login, gn1.actor_attributes.name,gn2.actor_attributes.blog, gn2.actor_attributes.company, gn2.actor_attributes.email, gn2.actor_attributes.gravatar_id, gn2.actor_attributes.location, gn2.actor_attributes.login, gn2.actor_attributes.name
FROM [publicdata:samples.github_nested] as gn1 inner join (select actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name from [publicdata:samples.github_nested] group by actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name) as gn2 on gn1.payload.target.login=gn2.actor_attributes.login
WHERE gn1.type='FollowEvent'

Without "inner join each" it says that the database size is too large. When I run the query with "inner join each" big query gives an error stating:

Cannot execute partitioned join because gn2 is not parallelizable: (SELECT [actor_attributes.blog], [actor_attributes.company], [actor_attributes.email], [actor_attributes.gravatar_id], [actor_attributes.location], [actor_attributes.login], [actor_attributes.name] FROM [publicdata:samples.github_nested] GROUP BY [actor_attributes.blog], [actor_attributes.company], [actor_attributes.email], [actor_attributes.gravatar_id], [actor_attributes.location], [actor_attributes.login], [actor_attributes.name])

Any help will be greatly appreciated

Thanks

1

1 Answers

1
votes

Thanks for having a [non] working public example. Makes debugging much easier.

Reformatting the original query:

SELECT gn1.actor_attributes.blog, gn1.actor_attributes.company, gn1.actor_attributes.email, gn1.actor_attributes.gravatar_id, gn1.actor_attributes.location, gn1.actor_attributes.login, gn1.actor_attributes.name,gn2.actor_attributes.blog, gn2.actor_attributes.company, gn2.actor_attributes.email, gn2.actor_attributes.gravatar_id, gn2.actor_attributes.location, gn2.actor_attributes.login, gn2.actor_attributes.name
FROM [publicdata:samples.github_nested] AS gn1 
INNER JOIN EACH (
  SELECT actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name
  FROM [publicdata:samples.github_nested]
  GROUP BY actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name) 
AS gn2 ON gn1.payload.target.login=gn2.actor_attributes.login
WHERE gn1.type='FollowEvent'

That query indeed fails with said error message. While the error message could be better, the solution is simple: Just add EACH to the GROUP BY on the sub-query, to make it parallelizable:

SELECT gn1.actor_attributes.blog, gn1.actor_attributes.company, gn1.actor_attributes.email, gn1.actor_attributes.gravatar_id, gn1.actor_attributes.location, gn1.actor_attributes.login, gn1.actor_attributes.name,gn2.actor_attributes.blog, gn2.actor_attributes.company, gn2.actor_attributes.email, gn2.actor_attributes.gravatar_id, gn2.actor_attributes.location, gn2.actor_attributes.login, gn2.actor_attributes.name
FROM [publicdata:samples.github_nested] AS gn1 
INNER JOIN EACH (
  SELECT actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name
  FROM [publicdata:samples.github_nested]
  GROUP EACH BY actor_attributes.blog, actor_attributes.company,actor_attributes.email, actor_attributes.gravatar_id, actor_attributes.location, actor_attributes.login, actor_attributes.name) 
AS gn2 ON gn1.payload.target.login=gn2.actor_attributes.login
WHERE gn1.type='FollowEvent'

[Query complete (12.7s elapsed, 237 MB processed)]