get a distinct count and repeating per row in BigQuery SQL

Question

i havent been able to find an answer to this but I think it should be possible to do this with only one query with BigQuery - i'm looking for something close to this, even with approximate results.

let's say i have a table that looks like this, with one column called myValue

====
myValue
====
foo  |
bar  |
bar  |
baz  |

i would like to be able to have a single query that results in a column next to it that has the distinct counts of myValue, on every row.

=======|=====================|
myValue|myNewCounts          |
=======|=====================|
foo    |[foo:1, bar:2, baz:1]
bar    |[foo:1, bar:2, baz:1]
bar    |[foo:1, bar:2, baz:1]
baz    |[foo:1, bar:2, baz:1]

I know that you can use ARRAY_AGG(distinct) to get the distinct values on every row, but I haven't been able to find a way to also get the counts as well on every row, even in an approximate fashion.

It's important that this be done in a single query - I could obviously have a separate query that calculates the distinct counts and then join that back to this table, but im trying to do this in one query.

one would think - that in a columnar database - returning myValue and myNewCounts in one pass should be doable somehow....

Mikhail Berlyant Mikhail Berlyant · Accepted Answer · 2020-05-16T04:06:19

Below is for BigQuery Standard SQL

#standardSQL
SELECT * FROM `project.dataset.table`, (
SELECT '[' || STRING_AGG(x, ', ') || ']' myNewCounts FROM (
  SELECT FORMAT('%s:%i', myValue, COUNT(1)) x
  FROM `project.dataset.table`
  GROUP BY myValue
))

if to apply to sample data from your question as in below example

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'foo' myValue UNION ALL
  SELECT 'bar' UNION ALL
  SELECT 'bar' UNION ALL
  SELECT 'baz' 
)
SELECT * FROM `project.dataset.table`, (
SELECT '[' || STRING_AGG(x, ', ') || ']' myNewCounts FROM (
  SELECT FORMAT('%s:%i', myValue, COUNT(1)) x
  FROM `project.dataset.table`
  GROUP BY myValue
))

result is

Row myValue myNewCounts  
1   foo     [foo:1, bar:2, baz:1]    
2   bar     [foo:1, bar:2, baz:1]    
3   bar     [foo:1, bar:2, baz:1]    
4   baz     [foo:1, bar:2, baz:1]

In case if myNewCounts is expected to be an array - use below version instead

#standardSQL
SELECT * FROM `project.dataset.table`, (
SELECT ARRAY_AGG(x) myNewCounts FROM (
  SELECT FORMAT('%s:%i', myValue, COUNT(1)) x
  FROM `project.dataset.table`
  GROUP BY myValue
))

get a distinct count and repeating per row in BigQuery SQL

1 Answers