Optimizing Postgres JSONB query with not null constraint

Question

I've got a Postgres 9.4.4 database with 1.7 million records with the following information stored in a JSONB column called data in a table called accounts:

data: {
  "lastUpdatedTime": "2016-12-26T12:09:43.901Z",
  "UID": "2c5bb7fd-1a00-4988-8d92-ffaa52ebc20d",
  "data": {
    "country": "UK",
    "verified_at": "2017-01-01T23:49:10.217Z"
  }
}

The data format cannot be changed since this is legacy information.

I need to obtain all accounts where the country is UK, the verified_at value is not null and the lastUpdatedTime value is greater than some given value.

So far, I have the following query:

SELECT * FROM "accounts"
WHERE (data @> '{ "data": { "country": "UK" } }')
AND (data->'data' ? 'verified_at')
AND ((data->'data' ->> 'verified_at') is not null)
AND (data ->>'lastUpdatedTime' > '2016-02-28T05:49:08.511846')
ORDER BY data ->>'lastUpdatedTime' LIMIT 100 OFFSET 0;

And the following indexes:

"accounts_idxgin" gin (data)
"accounts_idxgin_on_data" gin ((data -> 'data'::text))

I've managed to get the query time down to about 1000 to 4000ms

Here is the analyze from the query:

 Bitmap Heap Scan on accounts  (cost=41.31..6934.50 rows=9 width=1719)
                               (actual time=7.273..1067.657 rows=23190 loops=1)
   Recheck Cond: ((data -> 'data'::text) ? 'verified_at'::text)
   Filter: ((((data -> 'data'::text) ->> 'verified_at'::text) IS NOT NULL)
           AND ((data ->> 'lastUpdatedTime'::text) > '2016-02-01 05:49:08.511846'::text)
           AND (((data -> 'data'::text) ->> 'country'::text) = 'UK'::text))
   Rows Removed by Filter: 4
   Heap Blocks: exact=16039
   ->  Bitmap Index Scan on accounts_idxgin_on_data  (cost=0.00..41.30 rows=1773 width=0)
       (actual time=4.618..4.618 rows=23194 loops=1)
         Index Cond: ((data -> 'data'::text) ? 'verified_at'::text)
 Planning time: 0.448 ms
 Execution time: 1069.344 ms
(9 rows)

I have the following questions

Is there anything I can do to further speed up this query?
What is the correct way to speed up a field is not null query with JSONB? I ended up using the existence operator with (data->'data' ? 'verified_at') to filter out a large number of non-matching records, because much of my data doesn't have verified_at as a top level key. This increased the speed of the query, but I'm wondering if there's a general approach to optimizing this type of query.
In order to use the existence operator with (data->'data' ? 'verified_at'), I needed to add another index on ((data -> 'data'::text)). I already had an index on gin (data), but the existence operator didn't use this. Why is that? I thought the existence and containment operators would use this index.

pozs pozs · Accepted Answer · 2017-03-01T16:16:05

3: Not really. This case is explicitly mentioned in the docs. When you have an index on the column data, it is only used, when you query your table, like data @> '...' or data ? '...'. When you have an index on the expression (data -> 'data'), these queries can take advantage of it: (data -> 'data') @> '...' or (data -> 'data') ? '...'.

2: usual jsonb indexes won't help during a (jsonb_col -> '<key>') is [not] null query at all. And unfortunately, you cannot use jsonb_col @> '{"<key>":null}' either, because the JSON object might lack the key entirely. Also reverse use of the index (for is not null) is not possible at all. But there may be a trick...

1: Not much. There may be some improvements, but don't expect huge performance advantages. So here them go:

You can use the jsonb_path_ops operator class instead of the (default) jsonb_ops. This should mean a little improvement in performance, but they cannot use the existence operator (?). But we won't need it anyway.

You have a single, index-unfriendly, boolean typed expression, which slows you down. Thankfully you can use a partial index here if you only interested in true values.

So, your index should look something like this:

create index accounts_idxgin_on_data
  on accounts using gin ((data -> 'data') jsonb_path_ops)
  where (data -> 'data' ->> 'verified_at') is not null;

With this index, you can use the following query:

select   *
from     accounts
where    (data -> 'data') @> '{"country":"UK"}'
and      (data -> 'data' ->> 'verified_at') is not null
and      (data ->> 'lastUpdatedTime') > '2016-02-28T05:49:08.511Z'
order by data ->>'lastUpdatedTime';

Note: for proper timestamp comparisons, you should use (data ->> 'lastUpdatedTime')::timestamptz > '2016-02-28T05:49:08.511Z'.

http://rextester.com/QWUW41874

Optimizing Postgres JSONB query with not null constraint

3 Answers