1
votes

I am trying to query the collaborator logins, repository language and name from the github archive through Google BigQuery. The following query works fine if i exclude GROUP BY, but with GROUP BY the query goes on forever until I get a timeout from google bigquery. Since Google BigQuery doesn't have DISTINCT, I'm trying to use GROUP BY as DISTINCT so that i wouldn't get repeated lines. Here's the query i'm using :

SELECT
    a1.actor_attributes_login,
    a2.actor_attributes_login,
    a1.repository_language,
    a1.repository_name,
FROM
    [githubarchive:year.2014] AS a1
LEFT JOIN
    [githubarchive:year.2014] AS a2
ON
    a1.repository_name = a2.repository_name
WHERE
    a1.actor_attributes_login != a2.actor_attributes_login
    AND a1.actor_attributes_location = "California"
    AND (a1.repository_language = "Java"
      OR a1.repository_language = "Python")
GROUP BY
    a1.actor_attributes_login,
    a2.actor_attributes_login,
    a1.repository_language,
    a1.repository_name
LIMIT
    10000
1

1 Answers

1
votes

Hmmm. You might try removing duplicates before doing the join:

SELECT a1.actor_attributes_login, a2.actor_attributes_login,
       a1.repository_language, a1.repository_name
FROM (SELECT a.actor_attributes_login, a.repository_language, a1.repository_name
      FROM githubarchive:year.2014] a
      WHERE a.actor_attributes_location = 'California AND
            a.repository_language IN ('Java', 'Python')
      GROUP BY a.actor_attributes_login, a.repository_language, a.repository_name
     ) a1 LEFT JOIN
     (SELECT a1.actor_attributes_login, a1.repository_language, a1.repository_name
      FROM githubarchive:year.2014] a1
      GROUP BY a1.actor_attributes_login, a1.repository_language, a1.repository_name
     ) a2
     ON a1.repository_name = a2.repository_name
WHERE a1.actor_attributes_login <> a2.actor_attributes_login
LIMIT 10000;

I don't think you need the outer GROUP BY if you eliminate the duplicates in the subqueries.

Also, you should have an ORDER BY if you are using LIMIT.