GROUP BY problems in Google Big Query

Question

I'm new to Google Big Query (and Stack Overflow), mainly testing the speed at which Big Query manages to process both well-crafted and poorly-crafted queries.

I'm having difficulty with a cumbersome query that runs (slowly) on MySQL. Big Query complains about the GROUP BY contents. Here's the starting query:

SELECT nonstops.term, nonstops.lincat, nonstops.id, 
MIN(
(1-((LEAST(1,minusone.catimp / nonstops.catimp) + LEAST(1,minusone.catweb / nonstops.catweb))/2))*
(1-((LEAST(1,minusone.catimp / nonstops.catimp) + LEAST(1,minusone.catweb / nonstops.catweb))/2))*
(1-((LEAST(1,minusone.catimp / nonstops.catimp) + LEAST(1,minusone.catweb / nonstops.catweb))/2))*
(nonstops.catweb * nonstops.catweb * nonstops.catimp / nonstops.fnvweb / nonstops.fnvimp)
)
AS calc FROM nonstops INNER JOIN EACH minusone ON nonstops.lincat = minusone.lincat AND nonstops.term = minusone.term 
WHERE nonstops.lincat = 556 GROUP BY nonstops.term, nonstops.lincat
ORDER BY `calc`  DESC

Note that "EACH" is added to the INNER JOIN as both tables are big. I have removed the dataset name to make it easier to read.

The GROUP BY is intended to return just the lowest value of the calculation that is made by the join for each term/lincat pair.

The error I get is:

(L1:62): Expression 'phrases.nonstops.id' is not present in the GROUP BY list

Which I don't want in the GROUP BY, but I've added it and I then get:

Expression 'calc' is not present in the GROUP BY list

Which I also do not want! But if I add it I get:

(L7:1): Cannot group by an aggregate

I have looked at the documentation and searched for an answer, but no luck. Any hints or links would be most appreciated.

Michael Entin Michael Entin · Accepted Answer · 2015-06-07T06:41:54

When you group by something in BigQuery (or any SQL) the resulting fields should be either group-by fields or aggregations.

Otherwise, which value of nonstops.id should you get? There might be many associated with a pair of (nonstops.term, nonstops.lincat).

You could select max, min, etc of id field instead; group by this field (but then you get calculation for each tuple of term, lingcat and id); or just remove this field if you don't need it - if you want the result of 'calc' expression for each (nonstops.term, nonstops.lincat) pair.

The reason BigQuery complains about `calc` is different - BigQuery (when you use legacy SQL dialect) does not use backticks (`) for quotation. So it thinks they are part of the name and it is a new field separate from 'calc' field. Just remove them or use BigQuery's quotations [ and ] - [calc]. Or switch to standard SQL dialect.

GROUP BY problems in Google Big Query

1 Answers