Joining related log events in Redshift?

Question

We have a simple table in Redshift that stores semi-structured time series data:

            Table "public.system_log_lines"
    Column    |            Type             | Modifiers 
--------------+-----------------------------+-----------
 id           | bigint                      | not null
 received_at  | timestamp without time zone | not null
 program      | character varying(64)       | 
 message      | character varying(65535)    | 
Indexes:
    "system_log_lines_new_pkey" PRIMARY KEY, btree (id)

Its sortkey is received_at:

         table          | diststyle |  sortkey1   | skew_sortkey1 
------------------------+-----------+-------------+---------------
 system_log_lines       | EVEN      | received_at |         60.26

One common question about this data: "what events happen soon after a particular kind of unusual event?".

Expressing this type of question in SQL is straightforward:

select
    first_lines.received_at,
    more_lines.*
from (
    select
        log.received_at,
        log.program,
        log.message
    from system_log_lines as log
    where
        log.program = 'something.log'
        and log.message like '%SOME INTERESTING STRING%'
) as first_lines
left join system_log_lines as more_lines on
    more_lines.received_at between
        first_lines.received_at
        and first_lines.received_at + '1 minute'::interval
;

However, Redshift is incapable of executing this query successfully. It never terminates when running in practice. Here is the query plan produced:

                                                               QUERY PLAN                                                                
-----------------------------------------------------------------------------------------------------------------------------------------
 XN Nested Loop Left Join DS_BCAST_INNER  (cost=0.00..64217501937184.91 rows=2973033 width=237)
   Join Filter: (("inner".received_at <= ("outer".received_at + '00:01:00'::interval)) AND ("inner".received_at >= "outer".received_at))
   ->  XN Seq Scan on system_log_lines log  (cost=0.00..401359.38 rows=1 width=8)
         Filter: (((message)::text ~~ '%SOME INTERESTING STRING%'::text) AND ((program)::text = 'something.log'::text))
   ->  XN Seq Scan on system_log_lines more_lines  (cost=0.00..267572.92 rows=26757292 width=229)
 ----- Nested Loop Join in the query plan - review the join predicates to avoid Cartesian products -----
(6 rows)

The join is inefficient. That's not surprising given the BETWEEN condition. What is surprising is that the query still fails to terminate even if an explicit LIMIT 1 is applied to the subselect (i.e., there is only one "unusual event" to join against). Or even if no rows are returned by that subselect (i.e., there are no events to participate against).

That seems weird since the same query does terminate if the JOIN is removed, using literal received_at conditions, so it seems like even a simple nested loop join implementation (e.g., executing the joining loop at the application layer) could work acceptably.

Is there any way to structure this type of query such that Redshift can execute it successfully?

Gordon Linoff Gordon Linoff · Accepted Answer · 2017-08-19T19:07:05

You can use lag():

select l.*
from (select l.*,
             lag(l.program) over (order by received_at) as prev_program,
             lag(l.message) over (order by received_at) as prev_message,
             lag(l.received_at) over (order by received_at) as prev_received_at
      from from system_log_lines l
     ) l
where l.prev_program = 'something.log' and
      l.prev_message like '%SOME INTERESTING STRING%';

Window functions should be much faster than your join. You may want additional logic to capture the time dependency.

Joining related log events in Redshift?

1 Answers