Why is PostgreSQL not using just the covering index in this query depending on the contents of its IN() clause?

Question

I have a table with a covering index that should respond to a query using just the index, without checking the table at all. Postgres does, in fact, do that, if the IN() clause has 1 or a few elements in it. However, if the IN clause has lots of elements, it seems like it's doing the search on the index, and then going to the table and re-checking the conditions...

I can't figure out why Postgres would do that. It can either serve the query straight from the index or it can't, why would it go to the table if it (in theory) doesn't have anything else to add?

The table:

CREATE TABLE phone_numbers
(
  id serial NOT NULL,
  phone_number character varying,
  hashed_phone_number character varying,
  user_id integer,
  created_at timestamp without time zone,
  updated_at timestamp without time zone,
  ghost boolean DEFAULT false,
  CONSTRAINT phone_numbers_pkey PRIMARY KEY (id)
)
WITH (
  OIDS=FALSE
);

CREATE INDEX index_phone_numbers_covering_hashed_ghost_and_user
  ON phone_numbers
  USING btree
  (hashed_phone_number COLLATE pg_catalog."default", ghost, user_id);

The query I'm running is :

SELECT "phone_numbers"."user_id" 
FROM "phone_numbers" 
WHERE "phone_numbers"."hashed_phone_number" IN (*several numbers*) 
  AND "phone_numbers"."ghost" = 'f'

As you can see, the index has all the fields it needs to reply to that query.

And if I have only one or a few numbers in the IN clause, it does:

1 number:

Index Scan using index_phone_numbers_on_hashed_phone_number on phone_numbers (cost=0.41..8.43 rows=1 width=4)
Index Cond: ((hashed_phone_number)::text = 'bebd43a6eb29b2fda3bcb63dcc7ffaf5433e78660ccd1a495c1180a3eaaf6b6a'::text)
Filter: (NOT ghost)"

3 numbers:

Index Only Scan using index_phone_numbers_covering_hashed_ghost_and_user on phone_numbers (cost=0.42..17.29 rows=1 width=4)
Index Cond: ((hashed_phone_number = ANY ('{8228a8116f1fdb12e243102cb85ecd859ebf7873d9332dce5f1343a481ec72e8,43ddeebdca2ea829d468d5debc84d475c8322cf4bf6edca286c918b04216387e,1578bf773eb6eb8a9b57a130922a28c9c91f1bda67202ef5936b39630ca4cfe4}'::text[])) AND (...)
Filter: (NOT ghost)"

However, when I have a lot of numbers in the IN clause, Postgres is using the Index, but then hitting the table, and I don't know why:

Bitmap Heap Scan on phone_numbers (cost=926.59..1255.81 rows=106 width=4)
  Recheck Cond: ((hashed_phone_number)::text = ANY ('{b6459ce58f21d99c462b132cce7adc9ea947fa522a3849321e9fb65893006a5e,8228a8116f1fdb12e243102cb85ecd859ebf7873d9332dce5f1343a481ec72e8,ab3554acc1f287bb2e22ff20bb855e19a4177ef552676689d217dbb2a1a6177b,7ec9f58 (...)
  Filter: (NOT ghost)
  -> Bitmap Index Scan on index_phone_numbers_covering_hashed_ghost_and_user (cost=0.00..926.56 rows=106 width=0)
        Index Cond: (((hashed_phone_number)::text = ANY ('{b6459ce58f21d99c462b132cce7adc9ea947fa522a3849321e9fb65893006a5e,8228a8116f1fdb12e243102cb85ecd859ebf7873d9332dce5f1343a481ec72e8,ab3554acc1f287bb2e22ff20bb855e19a4177ef552676689d217dbb2a1a6177b,7e (...)

This is currently making this query, which is looking for 250 records in a table with 50k total rows, about twice as low as a similar query on another table, which looks for 250 records in a table with 5 million rows, which doesn't make much sense.

Any ideas what could be happening, and whether I can do anything to improve this?

UPDATE: Changing the order of the columns in the covering index to have first ghost and then hashed_phone_number also doesn't solve it:

Bitmap Heap Scan on phone_numbers (cost=926.59..1255.81 rows=106 width=4)
  Recheck Cond: ((hashed_phone_number)::text = ANY ('{b6459ce58f21d99c462b132cce7adc9ea947fa522a3849321e9fb65893006a5e,8228a8116f1fdb12e243102cb85ecd859ebf7873d9332dce5f1343a481ec72e8,ab3554acc1f287bb2e22ff20bb855e19a4177ef552676689d217dbb2a1a6177b,7ec9f58 (...)
  Filter: (NOT ghost)
  -> Bitmap Index Scan on index_phone_numbers_covering_ghost_hashed_and_user (cost=0.00..926.56 rows=106 width=0)
        Index Cond: ((ghost = false) AND ((hashed_phone_number)::text = ANY ('{b6459ce58f21d99c462b132cce7adc9ea947fa522a3849321e9fb65893006a5e,8228a8116f1fdb12e243102cb85ecd859ebf7873d9332dce5f1343a481ec72e8,ab3554acc1f287bb2e22ff20bb855e19a4177ef55267668 (...)

Gordon Linoff Gordon Linoff · Accepted Answer · 2015-05-26T11:51:23

The choice of indexes is based on what the optimizer says is the best solution for the query. Postgres is trying really hard with your index, but it is not the best index for the query.

The best index has ghost first:

CREATE INDEX index_phone_numbers_covering_hashed_ghost_and_user
  ON phone_numbers
  USING btree
  (ghost, hashed_phone_number COLLATE pg_catalog."default", user_id);

I happen to think that MySQL documentation does a good job of explaining how composite indexes are used.

Essentially, what is happening is that Postgres needs to do an index seek for every element of the in list. This may be compounded by the use of strings -- because collations/encodings affect the comparisons. Eventually, Postgres decides that other approaches are more efficient. If you put ghost first, then it will just jump to the right part of the index and find the rows it needs there.

Why is PostgreSQL not using *just* the covering index in this query depending on the contents of its IN() clause?

1 Answers

Why is PostgreSQL not using just the covering index in this query depending on the contents of its IN() clause?