I am trying to query the collaborator logins, repository language and name from the github archive through Google BigQuery. The following query works fine if i exclude GROUP BY, but with GROUP BY the query goes on forever until I get a timeout from google bigquery. Since Google BigQuery doesn't have DISTINCT, I'm trying to use GROUP BY as DISTINCT so that i wouldn't get repeated lines. Here's the query i'm using :
SELECT
a1.actor_attributes_login,
a2.actor_attributes_login,
a1.repository_language,
a1.repository_name,
FROM
[githubarchive:year.2014] AS a1
LEFT JOIN
[githubarchive:year.2014] AS a2
ON
a1.repository_name = a2.repository_name
WHERE
a1.actor_attributes_login != a2.actor_attributes_login
AND a1.actor_attributes_location = "California"
AND (a1.repository_language = "Java"
OR a1.repository_language = "Python")
GROUP BY
a1.actor_attributes_login,
a2.actor_attributes_login,
a1.repository_language,
a1.repository_name
LIMIT
10000