Redshift performance for simple time series data

Question

I am using an AWS Redshift table the holds information about invocations of functions. Each row has a date (of timestamp type), a UID (varchar), and several fields such as duration, error code. The size of the table is ~25 million rows of ~1000 different functions (each with a different UID).

My problem is that simple queries as a count of invocations of several functions in a time window take much time - usually 5-30 seconds.

I have tried different combinations of sort keys and dist key, but the performance seems to remain quite similar:

Setting the function UID as dist key
Setting a compound sort key of the date, the function UID and a combination of both in any order.

I have run VACUUM and ANALYZE on the table. I also tried to add/remove columns compression.

I am using only a single dc2.large node.

EDIT:

The table DDL is:

create table public."invocations_metrics_$mig"(
"function_uid" varchar(256)  NOT NULL encode RAW DISTKEY
,"date" timestamp    encode zstd 
,"duration" double precision   encode zstd 
,"used_memory" integer   encode zstd 
,"error" smallint   encode zstd 
,"has_early_exit" boolean   encode zstd 
,"request_id" varchar(256)   encode zstd 
)
 SORTKEY(date,function_uid);

An example of a row:

"aca500c9-27cc-47f8-a98f-ef71cbc7c0ef","2018-08-15 13:43:28.718",0.17,27,0,false,"30ee84e1-a091-11e8-ba47-b110721c41bc"

The query:

SELECT
    count(invocations_metrics_backup.function_uid) AS invocations,
    max(invocations_metrics_backup.date) AS last_invocation,
    invocations_metrics_backup.function_uid AS uid
FROM
    invocations_metrics_backup
WHERE
    function_uid IN (
        <10 UIDs>
    )
    AND DATE >= '2018-08-20T10:55:20.222812'::TIMESTAMP
GROUP BY
    function_uid

Total time is 5 seconds. The count in each query is ~5000. For the same query with a ~1M count it takes 30 seconds.

please show some example data and the exact sql you are using. together with the ddl for the table. also the exact timing for that query (the second run - not the first) — Jon Scott
Edited. I've disabled session cache for the example (otherwise the second query is very fast). Thanks! — Ronyis
your dist key may be causing the issue. try using "diststyle even" and without a sort key ans post the timings for that. — Jon Scott
There was an improvement of about 20%. It's nice but I think that this query should take 90% less time. — Ronyis

Joe Harris Joe Harris · Accepted Answer · 2018-09-07T13:18:53

First, you need to use at least 2 nodes. A single node has to do double duty as leader and compute. With 2 or more nodes you get a free leader node.

Then, change your DDL as follows, removing compression on the sort key:

CREATE TABLE public."invocations_metrics_$mig" (
    "function_uid"   varchar(256) NOT NULL ENCODE ZSTD,
    "date"           timestamp             ENCODE RAW,
    "duration"       double precision      ENCODE ZSTD,
    "used_memory"    integer               ENCODE ZSTD,
    "error"          smallint              ENCODE ZSTD,
    "has_early_exit" boolean               ENCODE ZSTD,
    "request_id"     varchar(256)          ENCODE ZSTD
) 
DISTSTYLE KEY 
DISTKEY( function_uid )
SORTKEY ( date )
;

You may also improve performance by mapping unique UIDs to an integer ID value and using that in your queries. UID values are quite inefficient to work with. The values occur randomly and are relatively wide with very high entropy. They are expensive during sorts, hash aggregations, and hash joins.

Redshift performance for simple time series data

1 Answers