0
votes

I'm trying to store some reports in postgres as a jsonb field.

DDL:

CREATE TABLE reports (
    id INT, 
    report JSONB, 
    PRIMARY KEY (id)
)

Conceptually, report has structure like this:

{
    "metainfo": "infodata",
    "expense":{
         "rows": [
             {
                 "item": "Repair",
                 "cost": 15300.00,
                 "ts": "2021-04-24",
             },
             {
                 "item": "tractor",
                 "cost": 120000.00,
                 "ts": "2021-04-03",
             },
             ...
         }
    ]
}

The field set differ between reports, so not all of them have, for example, "item" field.

Let's assume in our example we have expense report for April 2021. So now I want to select all items from all reports, where cost > 100000.00

When generate 1 million reports, I found it takes me ~30sec to extract this data. Is it possible to create b-tree index, so it covers my case and accelerate my query:

select id, arr.item->'cost'
from reports, jsonb_array_elements(report->'expense'->'rows') arr(row) 
where (arr.row->'cost')::numeric > 100000::numeric

With explain analyze (my table name is "jsonb1", not "reports")

Nested Loop  (cost=0.01..4795009.70 rows=66000132 width=36) (actual time=132.281..170719.239 rows=1959507 loops=1)
->  Seq Scan on jsonb1  (cost=0.00..470001.04 rows=2000004 width=1790) (actual time=0.098..44013.831 rows=2000004 loops=1)
->  Function Scan on jsonb_array_elements arr  (cost=0.01..1.76 rows=33 width=32) (actual time=0.021..0.030 rows=1 loops=2000004)
        Filter: (((item -> 'cost'::text))::numeric > '100000'::numeric)
        Rows Removed by Filter: 0
Planning Time: 0.077 ms
JIT:
  Functions: 6
  Options: Inlining true, Optimization true, Expressions true, Deforming true
  Timing: Generation 1.027 ms, Inlining 49.986 ms, Optimization 57.205 ms, Emission 23.538 ms, Total 131.756 ms
Execution Time: 186026.874 ms

I've tried other types of queries, e.g. jsonpath, but all of them lead to seq scan. index creation:

CREATE INDEX costbtree ON reports USING BTREE (((data->'expense'->'rows'->'cost')::numeric));

Postgres verion: PostgreSQL 12.2

2
It may be because of statistics of these columns. Query planner may decide that it is faster and cheaper just to scan table instead of use proposed index.Adam Tokarski
Please run this query with explain analyze and add this to your question.Adam Tokarski
Also, specify which version of postgres are you using?Adam Tokarski
@AdamTokarski, updated question. Also add one level in json structure, to be closer to real example.Kirill Bobrov
Your current query normalizes the data on the fly, that's why it can't really be indexed. If you normalize it right away indexing would be quite simplea_horse_with_no_name

2 Answers

0
votes

You cannot index such a query in PostgreSQL. You could use a GIN index if you wanted to search the rows that have a cost of exactly 100000, but searching with > is not possible.

The underlying problem is your data model: If you don't store the data as JSON, but use a normalized data model, where each of the rows is an actual table row, it would be quite simple to write such a query and index it for efficiency.

0
votes

You can index the max cost of a report to speed it up. You need to define a helper function first:

CREATE OR REPLACE FUNCTION public.max(jsonb, text)
 RETURNS numeric
 LANGUAGE sql
 IMMUTABLE PARALLEL SAFE
AS $function$
    SELECT max((x->>$2)::numeric) from jsonb_array_elements($1) f(x) 
$function$;

create index ON  reports (max(report->'expense'->'rows','cost'));

select id, arr.row->'cost'
from reports, jsonb_array_elements(report->'expense'->'rows') arr(row) 
where (arr.row->'cost')::numeric > 100000::numeric
  and max(report->'expense'->'rows','cost') > 10000;

You could define the constants 'expense', 'row', and 'cost' directly into the function if you wanted to make the usage more concise at the expense of making the function less general.