1
votes

We have almost 1B records in a replicated merge tree table.

  1. The primary key is a,b,c
  2. Our App keeps writing into this table with every user action. (we accumulate almost a million records per hour)
  3. We append (store) the latest timestamp (updated_at) for a given unique combination of (a,b)

The key requirement is to provide a roll-up against the latest timestamp for a given combination of a,b,c

Currently, we are processing the queries as

select a,b,c, sum(x), sum(y)...etc
from table_1
where (a,b,updated_at) in (select a,b,max(updated_at) from table_1 group by a,b)
and c in (...)
group by a,b,c

clarification on the sub-query

(select a,b,max(updated_at) from table_1 group by a,b)

^ This part is for illustration only.. our app writes latest updated_at for every a,b implying that the clause shown above is more like 

(select a,b,updated_at from tab_1_summary)

[where tab_1_summary has latest record for a given a,b]

Note: We have to keep the grouping criteria as-is.

The table is structured with partition (c) order by (a, b, updated_at)

Question is, is there a way to write a better query. (that can returns results faster..we are required to shave off few seconds from the overall processing)

FYI: We toyed working with Materialized View ReplicatedReplacingMergeTree. But, given the size of this table, and constant inserts + the FINAL clause doesn't necessarily work well as compared to the query above.

Thanks in advance!

1

1 Answers

1
votes

Just for test try to use join instead of tuple in (tuples):

select t.a, t.b, t.c, sum(x), sum(y)...etc 
from table_1 AS t inner join tab_1_summary using (a, b, updated_at) 
where c in (...) 
group by t.a, t.b, t.c

Consider using AggregatingMergeTree to pre-calculate result metrics:

CREATE MATERIALIZED VIEW table_1_mv
ENGINE = AggregatingMergeTree() 
PARTITION BY toYYYYMM(updated_at) 
ORDER BY (updated_at, a, b, c)
AS SELECT
    updated_at,
    a,b,c, 
    sum(x) AS x, /* see [SimpleAggregateFunction data type](https://clickhouse.tech/docs/en/sql-reference/data-types/simpleaggregatefunction/) */
    sum(y) AS y,
    /* For non-simple functions should be used [AggregateFunction data type](https://clickhouse.tech/docs/en/sql-reference/data-types/aggregatefunction/). */
    // etc..
FROM table_1 
GROUP BY updated_at, a, b, c;

And use this way to get result:

select a,b,c, sum(x), sum(y)...etc
from table_1_mv
where (updated_at,a,b) in (select updated_at,a,b from tab_1_summary)
    and c in (...)    
group by a,b,c