We have a simple table in Redshift that stores semi-structured time series data:
Table "public.system_log_lines"
Column | Type | Modifiers
--------------+-----------------------------+-----------
id | bigint | not null
received_at | timestamp without time zone | not null
program | character varying(64) |
message | character varying(65535) |
Indexes:
"system_log_lines_new_pkey" PRIMARY KEY, btree (id)
Its sortkey is received_at:
table | diststyle | sortkey1 | skew_sortkey1
------------------------+-----------+-------------+---------------
system_log_lines | EVEN | received_at | 60.26
One common question about this data: "what events happen soon after a particular kind of unusual event?".
Expressing this type of question in SQL is straightforward:
select
first_lines.received_at,
more_lines.*
from (
select
log.received_at,
log.program,
log.message
from system_log_lines as log
where
log.program = 'something.log'
and log.message like '%SOME INTERESTING STRING%'
) as first_lines
left join system_log_lines as more_lines on
more_lines.received_at between
first_lines.received_at
and first_lines.received_at + '1 minute'::interval
;
However, Redshift is incapable of executing this query successfully. It never terminates when running in practice. Here is the query plan produced:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------
XN Nested Loop Left Join DS_BCAST_INNER (cost=0.00..64217501937184.91 rows=2973033 width=237)
Join Filter: (("inner".received_at <= ("outer".received_at + '00:01:00'::interval)) AND ("inner".received_at >= "outer".received_at))
-> XN Seq Scan on system_log_lines log (cost=0.00..401359.38 rows=1 width=8)
Filter: (((message)::text ~~ '%SOME INTERESTING STRING%'::text) AND ((program)::text = 'something.log'::text))
-> XN Seq Scan on system_log_lines more_lines (cost=0.00..267572.92 rows=26757292 width=229)
----- Nested Loop Join in the query plan - review the join predicates to avoid Cartesian products -----
(6 rows)
The join is inefficient. That's not surprising given the BETWEEN condition. What is surprising is that the query still fails to terminate even if an explicit LIMIT 1 is applied to the subselect (i.e., there is only one "unusual event" to join against). Or even if no rows are returned by that subselect (i.e., there are no events to participate against).
That seems weird since the same query does terminate if the JOIN is removed, using literal received_at conditions, so it seems like even a simple nested loop join implementation (e.g., executing the joining loop at the application layer) could work acceptably.
Is there any way to structure this type of query such that Redshift can execute it successfully?