How to effectively group by two fields in BigQuery?

Question

I have a 160GB database in BigQuery that represents events sent from apps in the last month. Each row has (among other) date field, userId field and event field.

What I need to do is to find how many unique users have used the app on certain day. For example, if user X used app three times on 2017/09/15, count him only once for that day, but if he used the app on both 2017/09/15, 2017/09/16 and 2017/09/17, count him in each of those days (only once).

What I have tried to do is group users by userId and by date and then separately count number of userIds for each date, for example:

SELECT
  userId,
  DATE(date) AS date_conv
FROM
  [TABLE]
WHERE
  userId IS NOT NULL
GROUP BY
  date_conv,
  userId
ORDER BY
  userId

However, this query returns "Resources exceeded", even with increased billing tier, so it would be great if there would be a more efficient way.

Gordon Linoff Gordon Linoff · Accepted Answer · 2017-10-04T12:04:43

If you want to count the number of unique users, then use count(distinct):

SELECT DATE(date) AS date_conv, COUNT(DISTINCT userId) as numusers
FROM [TABLE]
GROUP BY DATE(date)
ORDER BY DATE(date);

You are probably getting the resources exceeded because of the ORDER BY.

EDIT:

I haven't tried COUNT(DISTINCT) on a really large volume of data on BigQuery. In some databases, nested aggregations optimizes much better:

SELECT DATE(date) AS date_conv, COUNT(*) as numusers
FROM (SELECT DATE(date) AS date_conv, userId
      FROM [TABLE]
      GROUP BY DATE(date), userId
     ) du
GROUP BY DATE(date)
ORDER BY DATE(date);

How to effectively group by two fields in BigQuery?

1 Answers