Deduplication in distributed clickhouse tables

Question

I have a question about distributed tables in clickhouse. Let's say I have two nodes with clickhouse. Each node have datatable with ReplacingMergeTree engine (I know that it's not guarantee full deduplication and I'm ok with that) in which data goes from kafka through kafka engine table (each node read from own topic). And on each node created datatable_distributed table. Now, for some reason, in each kafka topic goes the absolutly same message. Am I correctly understand, that in the end of day, making query to distributed_table I will see two rows with that message simply because distributed just read from two datatables on different clusters and there is no deduplicating?

look at this How to avoid duplicates in clickhouse table?. So at this moment deduplication is the feature of Relicated-merge engines. — vladimir
That is not an answer to my question. My question is, what is gonna happen if I have Distributed engine above ReplacingMergeTree tables — Алексей
Replacing*-engine doesn't guarantee 'eventually' deduplication because duplicates rows can be stored in different partitions that live independently with each other (it assumes that duplicated rows in different parts of the same partition be eventually merged & 'deduplicated'). A distributed table just gather data from shards and don't deduplicate data. In your case probably need to correctly configure the filters on Kafka-consumer (aka materialized view) to duplicated rows reside to the same node. — vladimir

Denny Crane Denny Crane · Accepted Answer · 2020-06-28T15:08:45

Yes. There is no Replacing(merges) across nodes. You should use sharding key and place records with the same primary key to one node. For example you can insert into Distributed egnine (from Kafka using MaterializedView) and set some sharding expression based on primary key (not rand()).

Deduplication in distributed clickhouse tables

1 Answers