Indexing jsonb column in Postgresql and IN clause

Question

I'm stuck with jsonb indexes and need help. I have a table with jsonb:

+-------+----------+------------------------------------------------------------+-------+
|id     |measure_id|parameters                                                  |value  |
+-------+----------+------------------------------------------------------------+-------+
|564174 |19        |{"1": 12, "2": 59, "5": 79, "6": 249, "7": 248, "8": 412}   |42.461 |
|564176 |19        |{"1": 12, "2": 59, "5": 80, "6": 249, "7": 248, "8": 412}   |46.198 |
|568244 |19        |{"1": 12, "2": 316, "5": 129, "6": 249, "7": 248, "8": 412} |19.482 |
|568246 |19        |{"1": 12, "2": 316, "5": 130, "6": 249, "7": 248, "8": 412} |20.051 |
|572313 |19        |{"1": 12, "2": 331, "5": 113, "6": 249, "7": 248, "8": 412} |7.098  |
|596434 |19        |{"1": 193, "2": 297, "5": 124, "6": 249, "7": 248, "8": 412}|103.253|
|682354 |22        |{"1": 427, "2": 25, "5": 121, "6": 426, "9": 441, "11": 428}|0.132  |
|686423 |22        |{"1": 427, "2": 60, "5": 72, "6": 426, "9": 443, "11": 428} |0.000  |
|1682439|44        |{"1": 193, "2": 518, "5": 91, "6": 426, "9": 429, "11": 431}|8.321  |
|1686787|44        |{"1": 193, "2": 515, "5": 96, "6": 426, "9": 429, "11": 431}|23.062 |
+-------+----------+------------------------------------------------------------+-------+

It's some statistical data and every row has measure and some parameters set. The number of parameters is different for every measure, so I put them in jsonb column. What I have to do:

select all distinct measures and parameters:

 SELECT DISTINCT 
     measure_id, 
     jsonb_object_keys(parameters) AS parameter_id, 
     parameters -> jsonb_object_keys(parameters) AS parameter_value_id 
 FROM data;

select data from this table:

 SELECT d.id,
    d.measure_id,
    CAST(d.attributes as TEXT) as attributes,
    CAST(d.parameters as TEXT) as parameters,
    d.value
 FROM data d
 WHERE d.measure_id=19
   AND (jsonb_extract_path(d.parameters, '1')::bigint in (12))
   AND (jsonb_extract_path(d.parameters, '2')::bigint in (2,59))
   AND (jsonb_extract_path(d.parameters, '5')::bigint in (79, 80, 129, 130, 113))
   AND (jsonb_extract_path(d.parameters, '6')::bigint in (249))
   AND (jsonb_extract_path(d.parameters, '7')::bigint in (248))
   AND (jsonb_extract_path(d.parameters, '8')::bigint in (412))
 ORDER BY d.id;

Both queries are running slow. My indexes:

CREATE INDEX idx_data_measure ON data USING btree (measure_id);

CREATE INDEX idx_data_parameters
ON data USING btree (((parameters ->> '1'::text)::bigint), ((parameters ->> '2'::text)::bigint),
         ((parameters ->> '5'::text)::bigint), ((parameters ->> '6'::text)::bigint),
         ((parameters ->> '7'::text)::bigint), ((parameters ->> '8'::text)::bigint),
         ((parameters ->> '9'::text)::bigint), ((parameters ->> '10'::text)::bigint),
         ((parameters ->> '11'::text)::bigint), ((parameters ->> '458'::text)::bigint),
         ((parameters ->> '717'::text)::bigint), ((parameters ->> '718'::text)::bigint),
         ((parameters ->> '719'::text)::bigint), ((parameters ->> '720'::text)::bigint));

I've tried to create one combined index:

CREATE INDEX idx_data_parameters ON data USING btree (measure_id, ((parameters ->> '1'::text)::bigint),...

but this doesn't help.

I've tried EXPLAIN ANALYZE, but honestly I don't understand it :(

EXPLAIN ANALYZE
SELECT DISTINCT
        measure_id,
        jsonb_object_keys(parameters) AS parameter_id,
        parameters -> jsonb_object_keys(parameters) AS parameter_value_id
FROM data;

QUERY PLAN
Unique  (cost=2212571.28..2222400.17 rows=982889 width=72) (actual time=79346.142..84316.123 rows=5050 loops=1)
  ->  Sort  (cost=2212571.28..2215028.50 rows=982889 width=72) (actual time=79346.141..82358.141 rows=5586011 loops=1)
        Sort Key: measure_id, (jsonb_object_keys(parameters)), ((parameters -> (jsonb_object_keys(parameters))))"
        Sort Method: external merge  Disk: 202816kB
        ->  Gather  (cost=1000.00..2034108.05 rows=982889 width=72) (actual time=2467.949..63448.545 rows=5586011 loops=1)
              Workers Planned: 2
              Workers Launched: 2
              ->  Result  (cost=0.00..1934819.15 rows=40953700 width=72) (actual time=2432.167..63305.298 rows=1862004 loops=3)
                    ->  ProjectSet  (cost=0.00..1218129.40 rows=40953700 width=156) (actual time=2432.151..62251.992 rows=1862004 loops=3)
                          ->  Parallel Seq Scan on data  (cost=0.00..1010289.37 rows=409537 width=124) (actual time=2432.118..61448.821 rows=327630 loops=3)
Planning Time: 0.417 ms
Execution Time: 84406.575 ms

I feel that I have wrong indexes, but can't create it properly. As I understand GIN is not good idea as I need IN clause for parameters, so I made BTREE. Please help me with it.

EDIT 1: PG Version: PostgreSQL 11.8. Also updated query to fit sample data.

EDIT 2: Query plan for select data SELECT...WHERE...:

Sort  (cost=1030.03..1030.04 rows=1 width=83) (actual time=63.659..63.661 rows=5 loops=1)
  Sort Key: id
  Sort Method: quicksort  Memory: 26kB
  Buffers: shared hit=4881
  ->  Index Scan using idx_data_measure on data d  (cost=0.55..1030.02 rows=1 width=83) (actual time=0.044..63.635 rows=5 loops=1)
        Index Cond: (measure_id = 19)
        Filter: (((jsonb_extract_path(parameters, VARIADIC '{2}'::text[]))::bigint = ANY ('{2,59}'::bigint[])) AND ((jsonb_extract_path(parameters, VARIADIC '{1}'::text[]))::bigint = 12) AND ((jsonb_extract_path(parameters, VARIADIC '{6}'::text[]))::bigint = 249) AND ((jsonb_extract_path(parameters, VARIADIC '{7}'::text[]))::bigint = 248) AND ((jsonb_extract_path(parameters, VARIADIC '{8}'::text[]))::bigint = 412) AND ((jsonb_extract_path(parameters, VARIADIC '{5}'::text[]))::bigint = ANY ('{79,80,129,130,113}'::bigint[])))"
        Rows Removed by Filter: 28733
        Buffers: shared hit=4881
Planning Time: 0.451 ms
Execution Time: 64.973 ms

I see that idx_data_measure is working, and that's all...

Your select distinct query has to open up and expand the parameters object for every single row. The indexes you are creating on this table is far more work (and space used) than going with either a wide and sparse table or a related pair of tables. — Mike Organek
I've added pg version to post and also edited query to fit sample data. @a_horse_with_no_name it's for single values. But what to do with multiple values? Make it where parameters @> '...' OR parameters @>'...'? It will be very long query, because user can select any parameter set. — zhoriq
It seems the condition on measure_id is already good enough to make use of the index on that column, so you probably don't need any additional indexes. — a_horse_with_no_name
@a_horse_with_no_name it's good while I have 3M records on my dev machine, on the server it's more than 200M records and this index is not enough. — zhoriq
Then you should add the explain (analyze) output from the production server — a_horse_with_no_name

a_horse_with_no_name a_horse_with_no_name · Accepted Answer · 2020-07-18T12:32:14

With Postgres 12, I would try something like this:

create index on data (measure_id);
create index on data using gin (parameters);

The operators @> and @? can make use of a GIN index, so maybe this gives better performance.

select *
from data
where measure_id = 19 
  and parameters @? '$."5" ? (@ == 79 || @ == 80 || @ == 129 || @ == 130 || @ == 112)'
  and parameters @> '{"1": 12, "2": 59, "6": 249, "7": 248, "8": 412}'

With Postgres 11 you can't use the @? operator, so maybe splitting the condition up into one that uses the @> operator (to enable use of the GIN index) and the other using the IN condition. That will only be efficient if the condition using the @> operator is highly selective.

select *
from data
where measure_id = 19 
  and parameters @> '{"1": 12, "2": 59, "6": 249, "7": 248, "8": 412}'
  and (parameters ->> '5')::bigint in (79,80,129,130,112);

Note there is no need to use jsonb_extract_path(). I don't know, but maybe using ->> is faster.

The plan you posted shows the query that retrieves all rows from the table and also unnests all elements from the JSON. There is no index that will speed that up.

But maybe doing the unnesting only once, makes this faster:

SELECT DISTINCT
        d.measure_id,
        p.ky AS parameter_id,
        p.value AS parameter_value_id
FROM data d
  cross join jsonb_each(d.parameters) as p(ky, value)

But again: no index will help with that query, as you are retrieving all rows from the table.

Indexing jsonb column in Postgresql and IN clause

1 Answers