2
votes

There are best practices out there which recommend to run the Mirror Maker on the target cluster. https://community.hortonworks.com/articles/79891/kafka-mirror-maker-best-practices.html

I wonder why this recommendation exists because ultimately all data must cross the border between the clusters, regardless of whether they are consumed at the target or produced at the source. A reason I can imagine is that the Mirror Maker supports multimple consumer but only one producer - so consuming data on the way with the greater latency might be speed up by the use of multiple consumers.

If performance because of multi threading is a point, would it be usefaul to use several producer (one per consumer) to replicate the data (with a custom replication process)? Does anyone knows why the Mirror Maker shares a single Producer among all consumers?

My usecase is the replication of data from several source cluster (~10) to a single target cluster. I would prefer to run the replication process on the source cluster to avoid to many replication processes (each for one source) on the target cluster.

Hints and suggestions on this topic are very welcome.

1
I believe the single producer aims to avoid disordered replications, but it's just a guess.aran
As I understand ordering is guaranteed per partition and as a consumer is bound to a specific partition why should a corresponding producer disorder the records at the target. I need to think about it ... Maybe rebalancing could be a point of failure...FrVaBe
It is guaranteed per partition, but only on consumer side. Two producers may produce to the same partition (and it's usual) If your original topic was filled by , for example, two different producers, each one with its own thread, launching the same two producers again won't make a perfect copy, in the 99% of times. I believe that to make sure you are producing in the same order, you must consume in order (which Kafka guarantees on partition side), and launch just one producer, acting as a "sequential" copier. That's my guess thou! : ) I may be 100% incorrect.aran

1 Answers

2
votes

I also put the question in the Apache Kafka Mailing List:
https://lists.apache.org/thread.html/06a3c3ec10e4c44695ad0536240450919843824fab206ae3f390a7b8@%3Cusers.kafka.apache.org%3E

I would like to quote some reasonable answers here:

Franz, you can run MM on or near either source or target cluster, but it's more efficient near the target because this minimizes producer latency. If latency is high, poducers will block waiting on ACKs for in-flight records, which reduces throughput.

I recommend running MM near the target cluster but not necessarily on the same machines, because often Kafka nodes are relatively expensive, with SSD arrays and huge IO bandwidth etc, which isn't necessary for MM.

Ryanne

and

Hi, Franz!

I guess, one of the reasons could be additional safety in case of network split.

It is also some probability of bugs even with good software. So, If we place MM on source cluster and network will split, consumers could (theoretically) continue to read messages from source cluster and commit them even without asks from destination cluster (one of possible bugs). This way you will end up with lost messages on producer after network fix.

On the other hand, if we place MM on destination cluster and network will split, nothing bad happens. MM will be unable to grep data from source cluster, so you data won’t corrupt even in case of bugs.

Tolya