Logical checks and performance issues within a COUNT(DISTINCT foo)

Question

I have a regular and very expensive query that I need to run, and unfortunately I have to JOIN the results of that query with almost the exact same query in order to get a ratio... resulting in using a query taking over 3minutes to run. This is what I would like to do.... (assuming that avoiding a JOIN would speed up the query time)

SELECT
    date,
    meal,
    country,
    COUNT(DISTINCT person, WHERE UPPER(ingredient) CONTAINS "SUN BUTTER", 10000000) as total_sunbutter_meals_per_day
    COUNT(DISTINCT person, 10000000) as total_meals
    ROUND(100*total_sunbutter_meals_per_day/total_meals,1) as percentage_meals_sunbutter
FROM [project:dataset.menu]
GROUP BY date, meals, country

This is what I am forced to do....

SELECT
    total.date as date,
    total.meal as meal,
    total.country as country,
    total_sunbutter_meals_per_day,
    total_meals_per_day,
    ROUND(100*total_sunbutter_meals_per_day/total_meals,1) as percentage_meals_sunbutter
FROM
    (    
    SELECT
        date,
        meal,
        country,
        COUNT(DISTINCT person, 100000) as total_sunbutter_meals_per_day
    FROM [project:dataset.menu]
    WHERE    
        UPPER(ingredient) CONTAINS "SUN BUTTER"
    GROUP BY date, meals, country 
    ) as sunbutter
JOIN
    (
    SELECT
        date,
        meal,
        country,
        COUNT(DISTINCT person, 100000) as total_meals_per_day
    FROM [project:dataset.menu]
    GROUP BY date, meals, country 
    ) as total
ON total.date = sunbutter.date AND total.meal = sunbutter.meal AND total.country = sunbutter.country

Three issues/questions:

It seems like there should be a way Big Query could perform a COUNT(DISTINCT field) with some embedded conditional logic. Is there a way to avoid doing a join in this scenario above?
COUNT DISTINCT with a value greater than 100,000 has failed for me. I'd like to be able to do a COUNT DISTINCT of 10,000,000. Are there known performance issue with COUNT DISTINCT and large values? Is this being addressed?
Are there plans to be able to use the declared/calculated field name within the SELECT in another statement within that SELECT? in the top example, I'd like to use the names for the results instead of repeating the formula in the ROUND statement. (ie I'd like to specify

total_sunbutter_meals_per_day/total_meals Instead of

COUNT(DISTINCT person, WHERE UPPER(ingredient) CONTAINS "SUN BUTTER", 100000)/COUNT(DISTINCT person, 10000000)

Thanks in advance for the help!

Jeremy Condit Jeremy Condit · Accepted Answer · 2014-02-27T22:44:44

Question 1:

You could create an inner query with two distinct fields as follows:

SELECT
  date,
  meal,
  country,
  COUNT(DISTINCT person) total_meals,
  COUNT(DISTINCT sunbutter_person) total_sunbutter_meals,
FROM
  (SELECT
     date,
     meal,
     country,
     person,
     IF(UPPER(ingredient) CONTAINS "SUN BUTTER", person, NULL) sunbutter_person
   FROM [project:dataset.menu])

Question 2:

In BigQuery, COUNT(DISTINCT) returns an approximate result. If you increase the threshold at which precise results are returned, you're going to hurt performance (and eventually cause your query to fail) because a single worker needs to keep track of all of those distinct values. See BigQuery COUNT(DISTINCT value) vs COUNT(value) for more info.

If your need for precise results outgrows COUNT(DISTINCT)'s scalability, the alternative is to use GROUP EACH BY with COUNT(*), which will give you precise counts for distinct elements in a scalable way.

Note that you'll need to address the problem in Question 1 in a slightly different way. Something like:

SELECT
  date,
  meal,
  country,
  COUNT(*) total_meals,
  SUM(sunbutter) total_sunbutter_meals,
FROM
  (SELECT
     date,
     meal,
     country,
     IF(UPPER(ingredient) CONTAINS "SUN BUTTER", 1, 0) sunbutter,
   FROM [project:dataset.menu]
   GROUP EACH BY date, meal, country, person)
GROUP BY date, meal, country

Question 3:

Right now, you can't refer to other fields in the same SELECT statement, and we don't yet have plans to add that feature. But you can always wrap your query in another query.

Instead of:

SELECT 17 AS a, a + 1 AS b

You can write:

SELECT a, a + 1 AS b FROM (SELECT 17 AS a)

Logical checks and performance issues within a COUNT(DISTINCT foo)

1 Answers