1
votes

I've got a Postgres 9.4.4 database with 1.7 million records with the following information stored in a JSONB column called data in a table called accounts:

data: {
  "lastUpdatedTime": "2016-12-26T12:09:43.901Z",
  "UID": "2c5bb7fd-1a00-4988-8d92-ffaa52ebc20d",
  "data": {
    "country": "UK",
    "verified_at": "2017-01-01T23:49:10.217Z"
  }
}

The data format cannot be changed since this is legacy information.

I need to obtain all accounts where the country is UK, the verified_at value is not null and the lastUpdatedTime value is greater than some given value.

So far, I have the following query:

SELECT * FROM "accounts"
WHERE (data @> '{ "data": { "country": "UK" } }')
AND (data->'data' ? 'verified_at')
AND ((data->'data' ->> 'verified_at') is not null)
AND (data ->>'lastUpdatedTime' > '2016-02-28T05:49:08.511846')
ORDER BY data ->>'lastUpdatedTime' LIMIT 100 OFFSET 0;

And the following indexes:

"accounts_idxgin" gin (data)
"accounts_idxgin_on_data" gin ((data -> 'data'::text))

I've managed to get the query time down to about 1000 to 4000ms

Here is the analyze from the query:

 Bitmap Heap Scan on accounts  (cost=41.31..6934.50 rows=9 width=1719)
                               (actual time=7.273..1067.657 rows=23190 loops=1)
   Recheck Cond: ((data -> 'data'::text) ? 'verified_at'::text)
   Filter: ((((data -> 'data'::text) ->> 'verified_at'::text) IS NOT NULL)
           AND ((data ->> 'lastUpdatedTime'::text) > '2016-02-01 05:49:08.511846'::text)
           AND (((data -> 'data'::text) ->> 'country'::text) = 'UK'::text))
   Rows Removed by Filter: 4
   Heap Blocks: exact=16039
   ->  Bitmap Index Scan on accounts_idxgin_on_data  (cost=0.00..41.30 rows=1773 width=0)
       (actual time=4.618..4.618 rows=23194 loops=1)
         Index Cond: ((data -> 'data'::text) ? 'verified_at'::text)
 Planning time: 0.448 ms
 Execution time: 1069.344 ms
(9 rows)

I have the following questions

  1. Is there anything I can do to further speed up this query?
  2. What is the correct way to speed up a field is not null query with JSONB? I ended up using the existence operator with (data->'data' ? 'verified_at') to filter out a large number of non-matching records, because much of my data doesn't have verified_at as a top level key. This increased the speed of the query, but I'm wondering if there's a general approach to optimizing this type of query.
  3. In order to use the existence operator with (data->'data' ? 'verified_at'), I needed to add another index on ((data -> 'data'::text)). I already had an index on gin (data), but the existence operator didn't use this. Why is that? I thought the existence and containment operators would use this index.
3

3 Answers

3
votes

3: Not really. This case is explicitly mentioned in the docs. When you have an index on the column data, it is only used, when you query your table, like data @> '...' or data ? '...'. When you have an index on the expression (data -> 'data'), these queries can take advantage of it: (data -> 'data') @> '...' or (data -> 'data') ? '...'.

2: usual jsonb indexes won't help during a (jsonb_col -> '<key>') is [not] null query at all. And unfortunately, you cannot use jsonb_col @> '{"<key>":null}' either, because the JSON object might lack the key entirely. Also reverse use of the index (for is not null) is not possible at all. But there may be a trick...

1: Not much. There may be some improvements, but don't expect huge performance advantages. So here them go:

You can use the jsonb_path_ops operator class instead of the (default) jsonb_ops. This should mean a little improvement in performance, but they cannot use the existence operator (?). But we won't need it anyway.

You have a single, index-unfriendly, boolean typed expression, which slows you down. Thankfully you can use a partial index here if you only interested in true values.

So, your index should look something like this:

create index accounts_idxgin_on_data
  on accounts using gin ((data -> 'data') jsonb_path_ops)
  where (data -> 'data' ->> 'verified_at') is not null;

With this index, you can use the following query:

select   *
from     accounts
where    (data -> 'data') @> '{"country":"UK"}'
and      (data -> 'data' ->> 'verified_at') is not null
and      (data ->> 'lastUpdatedTime') > '2016-02-28T05:49:08.511Z'
order by data ->>'lastUpdatedTime';

Note: for proper timestamp comparisons, you should use (data ->> 'lastUpdatedTime')::timestamptz > '2016-02-28T05:49:08.511Z'.

http://rextester.com/QWUW41874

1
votes

Use the path access operator for faster access to lower-level objects:

SELECT * FROM "accounts"
WHERE data #>> '{data, country}' = 'UK'
  AND data #>> '{data, verified_at}' IS NOT NULL
  AND data ->> 'lastUpdatedTime' > '2016-02-28T05:49:08.511846'
ORDER BY data ->> 'lastUpdatedTime' LIMIT 100 OFFSET 0;

The index only works on the top-level key. So, with an index on column data queries like data @> [[key]] are supported. However, for a query on data -> 'data' ? 'verified_at' you need an index on data->'data'.

Two more points:

  • I don't think it is necessary to test for the presence of verified_at. If it is not there it simply comes out as NULL so it gets caught by the same test.
  • Comparing string representations of timestamp values may work if the JSON value is properly and consistently formatted. Cast to timestamp to be on the safe side.
1
votes

After playing around a bit more, I've managed to reduce my query time from around 1000ms to 350ms by creating the following partial index:

 CREATE INDEX index_accounts_partial_on_verified_at
 ON accounts ((data->'data'->'verified_at'))
 WHERE (data->'data'->>'verified_at') IS NOT NULL
 AND (data->'data' ? 'verified_at')
 AND (data->'data'->>'country' = 'UK');

I was able to hardcode some of the values in this index, such as country=UK because I only need to consider UK accounts for this query. I was also able to remove the index on ((data->'data')) which was 258MB, and replace it with the partial index which is only 1360 kB!

For anyone interested, I found the details for building a partial JSONB index from here