Wrong results with group by for distinct count

Question

I have these two queries for calculating a distinct count from a table for a particular date range. In my first query I group by location, aRID ( which is a rule ) and date. In my second query I don't group by a date.

I am expecting the same distinct count in both the results but I get total count as 6147 in first result and 6359 in second result. What is wrong here? The difference is group by..

select
  r.loc 
 ,cast(r.date as DATE) as dateCol
 ,count(distinct r.dC) as dC_count
from table r
where r.date between '01-01-2018' and '06-02-2018'
and r.loc = 1
group by r.loc, r.aRId, cast(r.date as DATE)

select
r.loc 
,count(distinct r.DC) as dC_count
from table r
and r.date between '01-01-2018' and '06-02-2018'
and r.loc = 1
group by r.loc, r.aRId

loc dateCol     dC_count
1   2018-01-22  1
1   2018-03-09  2
1   2018-01-28  3
1   2018-01-05  1
1   2018-05-28  143
1   2018-02-17  1
1   2018-05-08  187
1   2018-05-31  146
1   2018-01-02  3
1   2018-02-14  1
1   2018-05-11  273
1   2018-01-14  1
1   2018-03-18  2
1   2018-02-03  1
1   2018-05-20  200
1   2018-05-14  230
1   2018-01-11  5
1   2018-01-31  1
1   2018-05-17  209
1   2018-01-20  2
1   2018-03-01  1
1   2018-01-03  3
1   2018-05-06  253
1   2018-05-26  187
1   2018-03-24  1
1   2018-02-09  1
1   2018-03-04  1
1   2018-05-03  269
1   2018-05-23  187
1   2018-05-29  133
1   2018-03-21  1
1   2018-03-27  1
1   2018-05-15  202
1   2018-03-07  1
1   2018-06-01  155
1   2018-02-21  1
1   2018-01-26  2
1   2018-02-15  2
1   2018-05-12  331
1   2018-03-10  1
1   2018-01-09  3
1   2018-02-18  1
1   2018-03-13  2
1   2018-05-09  184
1   2018-01-12  2
1   2018-03-16  1
1   2018-05-18  198
1   2018-02-07  1
1   2018-02-01  1
1   2018-01-15  3
1   2018-02-24  4
1   2018-03-19  1
1   2018-05-21  161
1   2018-02-10  1
1   2018-05-04  250
1   2018-05-30  148
1   2018-05-24  153
1   2018-01-24  1
1   2018-05-10  199
1   2018-03-08  1
1   2018-01-21  1
1   2018-05-27  151
1   2018-01-04  3
1   2018-05-07  236
1   2018-03-25  1
1   2018-03-11  2
1   2018-01-10  1
1   2018-01-30  1
1   2018-03-14  1
1   2018-02-19  1
1   2018-05-16  192
1   2018-01-13  5
1   2018-01-07  1
1   2018-03-17  3
1   2018-01-27  2
1   2018-02-22  1
1   2018-05-13  200
1   2018-02-08  2
1   2018-01-16  2
1   2018-03-03  1
1   2018-05-02  217
1   2018-05-22  163
1   2018-03-20  1
1   2018-02-05  2
1   2018-02-11  1
1   2018-01-19  2
1   2018-02-28  1
1   2018-05-05  332
1   2018-05-25  211
1   2018-03-23  1
1   2018-05-19  219

loc dC_count
1   6359

So, do some locations have multiple records for one or more dates? — Ben
It is 6417 in the first case (not 6147). It sounds valid to me as there are more chances to find distinct values on a "shorter" intervals. — Serge Makarov

sticky bit sticky bit · Accepted Answer · 2018-06-21T21:48:36

From "COUNT (Transact-SQL)"

COUNT(DISTINCT expression) evaluates expression for each row in a group, and returns the number of unique, nonnull values.

The distinct is relative to the group, not to the whole table (or selected subset). I think this might be your misconception here.

To better understand what this means, take the following simplified example:

CREATE TABLE group_test
             (a varchar(1),
              b varchar(1),
              c varchar(1));

INSERT INTO group_test
            (a,
             b,
             c)
            VALUES ('a',
                    'r',
                    'x'),
                   ('a',
                    's',
                    'x'),
                   ('b',
                    'r',
                    'x'),
                   ('b',
                    's',
                    'y');

If we GROUP BY a and select count(DISTINCT c)

SELECT a,
       count(DISTINCT c) #
       FROM group_test
       GROUP BY a;

we get

 a  | #
----|----
 a  |  1
 b  |  2

As there is only c='x' for a=1, there is only a distinct count of 1 for this group but 2 for the other group as it has 'x'and 'y' in c. The sum of counts is 3 here.

Now if we GROUP BY a, b

SELECT a,
       b,
       count(DISTINCT c) #
       FROM group_test
       GROUP BY a,
                b;

we get

 a  | b  | #
----|----|----
 a  | r  |  1
 a  | s  |  1
 b  | r  |  1
 b  | s  |  1

We get 1 for every count here as each value of c is the only one in the group. And all of a sudden the sum of counts is 4.

And if we get the distinct count of c for the whole table

SELECT count(DISTINCT c) #
       FROM group_test;

we get

 #  
----
  2

which sums up to 2.

The sum of the counts is different in each case but right none the less.

The more groups there are, the higher the chance for a value to be unique within that group. So your results seem totally plausible.

db<>fiddle

Wrong results with group by for distinct count

1 Answers