I have a regular and very expensive query that I need to run, and unfortunately I have to JOIN the results of that query with almost the exact same query in order to get a ratio... resulting in using a query taking over 3minutes to run. This is what I would like to do.... (assuming that avoiding a JOIN would speed up the query time)
SELECT
date,
meal,
country,
COUNT(DISTINCT person, WHERE UPPER(ingredient) CONTAINS "SUN BUTTER", 10000000) as total_sunbutter_meals_per_day
COUNT(DISTINCT person, 10000000) as total_meals
ROUND(100*total_sunbutter_meals_per_day/total_meals,1) as percentage_meals_sunbutter
FROM [project:dataset.menu]
GROUP BY date, meals, country
This is what I am forced to do....
SELECT
total.date as date,
total.meal as meal,
total.country as country,
total_sunbutter_meals_per_day,
total_meals_per_day,
ROUND(100*total_sunbutter_meals_per_day/total_meals,1) as percentage_meals_sunbutter
FROM
(
SELECT
date,
meal,
country,
COUNT(DISTINCT person, 100000) as total_sunbutter_meals_per_day
FROM [project:dataset.menu]
WHERE
UPPER(ingredient) CONTAINS "SUN BUTTER"
GROUP BY date, meals, country
) as sunbutter
JOIN
(
SELECT
date,
meal,
country,
COUNT(DISTINCT person, 100000) as total_meals_per_day
FROM [project:dataset.menu]
GROUP BY date, meals, country
) as total
ON total.date = sunbutter.date AND total.meal = sunbutter.meal AND total.country = sunbutter.country
Three issues/questions:
- It seems like there should be a way Big Query could perform a COUNT(DISTINCT field) with some embedded conditional logic. Is there a way to avoid doing a join in this scenario above?
- COUNT DISTINCT with a value greater than 100,000 has failed for me. I'd like to be able to do a COUNT DISTINCT of 10,000,000. Are there known performance issue with COUNT DISTINCT and large values? Is this being addressed?
Are there plans to be able to use the declared/calculated field name within the SELECT in another statement within that SELECT? in the top example, I'd like to use the names for the results instead of repeating the formula in the ROUND statement. (ie I'd like to specify
total_sunbutter_meals_per_day/total_meals Instead of
COUNT(DISTINCT person, WHERE UPPER(ingredient) CONTAINS "SUN BUTTER", 100000)/COUNT(DISTINCT person, 10000000)
Thanks in advance for the help!