I have a question about distributed tables in clickhouse. Let's say I have two nodes with clickhouse. Each node have datatable with ReplacingMergeTree engine (I know that it's not guarantee full deduplication and I'm ok with that) in which data goes from kafka through kafka engine table (each node read from own topic). And on each node created datatable_distributed table. Now, for some reason, in each kafka topic goes the absolutly same message. Am I correctly understand, that in the end of day, making query to distributed_table I will see two rows with that message simply because distributed just read from two datatables on different clusters and there is no deduplicating?
0
votes
look at this How to avoid duplicates in clickhouse table?. So at this moment deduplication is the feature of Relicated-merge engines.
- vladimir
That is not an answer to my question. My question is, what is gonna happen if I have Distributed engine above ReplacingMergeTree tables
- Алексей
Replacing*-engine doesn't guarantee 'eventually' deduplication because duplicates rows can be stored in different partitions that live independently with each other (it assumes that duplicated rows in different parts of the same partition be eventually merged & 'deduplicated'). A distributed table just gather data from shards and don't deduplicate data. In your case probably need to correctly configure the filters on Kafka-consumer (aka materialized view) to duplicated rows reside to the same node.
- vladimir