postgresql COUNT(DISTINCT …) very slow

186

votes

I have a very simple SQL query:

SELECT COUNT(DISTINCT x) FROM table;

My table has about 1.5 million rows. This query is running pretty slowly; it takes about 7.5s, compared to

 SELECT COUNT(x) FROM table;

which takes about 435ms. Is there any way to change my query to improve performance? I've tried grouping and doing a regular count, as well as putting an index on x; both have the same 7.5s execution time.

performancepostgresqlcountdistinct

I don't think so. Getting the distinct values of 1.5 million rows is just going to be slow. – Ry-♦

I just tried it in C#, getting the distinct values of 1.5 million integers from memory takes over one second on my computer. So I think you're probably out of luck. – Ry-♦

The query plan will very much depend on the table structure (indexes) and the setting of the tuning constants (work)mem, effective_cache_size, random_page_cost). With reasonable tuning the query could possibly be executed in less than a second. – wildplasser

Could you be more specific? What indexes and tuning constants would be required to get it under a second? For simplicity, assume this is a two-column table with a primary key on the first column y, and I'm doing this 'distinct' query on a second column x of type int, with 1.5 million rows. – ferson2020

Please, include the table definition with all the indexes (\d output of psql is good one) and precise the column that you have problem with. It'd be good to see EXPLAIN ANALYZE of both queries. – vyegorov

366

votes

You can use this:

SELECT COUNT(*) FROM (SELECT DISTINCT column_name FROM table_name) AS temp;

This is much faster than:

COUNT(DISTINCT column_name)

12

votes

-- My default settings (this is basically a single-session machine, so work_mem is pretty high)
SET effective_cache_size='2048MB';
SET work_mem='16MB';

\echo original
EXPLAIN ANALYZE
SELECT
        COUNT (distinct val) as aantal
FROM one
        ;

\echo group by+count(*)
EXPLAIN ANALYZE
SELECT
        distinct val
       -- , COUNT(*)
FROM one
GROUP BY val;

\echo with CTE
EXPLAIN ANALYZE
WITH agg AS (
    SELECT distinct val
    FROM one
    GROUP BY val
    )
SELECT COUNT (*) as aantal
FROM agg
        ;

Results:

original                                                      QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=36448.06..36448.07 rows=1 width=4) (actual time=1766.472..1766.472 rows=1 loops=1)
   ->  Seq Scan on one  (cost=0.00..32698.45 rows=1499845 width=4) (actual time=31.371..185.914 rows=1499845 loops=1)
 Total runtime: 1766.642 ms
(3 rows)

group by+count(*)
                                                         QUERY PLAN                                                         
----------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=36464.31..36477.31 rows=1300 width=4) (actual time=412.470..412.598 rows=1300 loops=1)
   ->  HashAggregate  (cost=36448.06..36461.06 rows=1300 width=4) (actual time=412.066..412.203 rows=1300 loops=1)
         ->  Seq Scan on one  (cost=0.00..32698.45 rows=1499845 width=4) (actual time=26.134..166.846 rows=1499845 loops=1)
 Total runtime: 412.686 ms
(4 rows)

with CTE
                                                             QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=36506.56..36506.57 rows=1 width=0) (actual time=408.239..408.239 rows=1 loops=1)
   CTE agg
     ->  HashAggregate  (cost=36464.31..36477.31 rows=1300 width=4) (actual time=407.704..407.847 rows=1300 loops=1)
           ->  HashAggregate  (cost=36448.06..36461.06 rows=1300 width=4) (actual time=407.320..407.467 rows=1300 loops=1)
                 ->  Seq Scan on one  (cost=0.00..32698.45 rows=1499845 width=4) (actual time=24.321..165.256 rows=1499845 loops=1)
       ->  CTE Scan on agg  (cost=0.00..26.00 rows=1300 width=0) (actual time=407.707..408.154 rows=1300 loops=1)
     Total runtime: 408.300 ms
    (7 rows)

The same plan as for the CTE could probably also be produced by other methods (window functions)

3

votes

If your count(distinct(x)) is significantly slower than count(x) then you can speed up this query by maintaining x value counts in different table, for example table_name_x_counts (x integer not null, x_count int not null), using triggers. But your write performance will suffer and if you update multiple x values in single transaction then you'd need to do this in some explicit order to avoid possible deadlock.

0

votes

I was also searching same answer, because at some point of time I needed total_count with distinct values along with limit/offset.

Because it's little tricky to do- To get total count with distinct values along with limit/offset. Usually it's hard to get total count with limit/offset. Finally I got the way to do -

SELECT DISTINCT COUNT(*) OVER() as total_count, * FROM table_name limit 2 offset 0;

Query performance is also high.

-3

votes

select coluna, count(coluna) as qtd from tabela group by coluna

postgresql COUNT(DISTINCT …) very slow

5 Answers