Apache Flink and Kafka Stream Benchmarking

Question

What I'm trying to do

For a university project i am are trying to compare Apache Flink and Apache Kafka Streaming Performance (Throughput, Latency) using different configurations (1 Nodes, 2 Nodes, 4 Nodes, changing amount of CPU cores etc.).

For this purpose i have created a twitter JSON dataset containing ~ 15.000 Tweets, each tweet delimited by a newline.

The Problem

As far as i know Kafka follows the pattern "Producer - Kafka Cluster/Brokers - Consumer" and, for benchmarking the latency, i would measure the time between Producer and Consumer for each record.

The problem is, as far as i can tell Apache Flink lacks the ability of the Producer pattern - it seems like i have to specify a source for the data stream which the TaskManagers ("Consumers") would then fetch and process.

This makes it hard for me to tell how i should benchmark both systems in a comparable way because, for the latency measurement, i would measure the time between Producer and Consumer whereas in Flink i would have to measure the time between the JobManager and TaskManagers. So the producer part would be missing here.

Assuming i haven't misunderstood something, how would i measure both systems in a comparable way in order to make reasonable judgements?

David Anderson David Anderson · Accepted Answer · 2021-04-15T17:26:35

What you could do with Flink would be to build this pipeline:

event generator -> input topic -> flink job -> output topic -> analysis

I would configure both Kafka producers (the one in the event generator and the one in Flink) to use log-append timestamps. And I would arrange for the flink job to copy the incoming timestamps to a field of the corresponding output records so that what gets written to the output topic are events that have two timestamps: one generated by the kafka broker for the input partition, and the other generated by the broker handling the output partition. I would further arrange it so that the same broker is used in both cases, so that the timestamps are clearly comparable.

You should be able to do pretty much the same thing for Kafka Streaming.

BTW, 15000 tweets will flow through this way too quickly to get meaningful results. I recommend implementing an event generator that that can pump out arbitrarily long event streams.

Apache Flink and Kafka Stream Benchmarking

What I'm trying to do

The Problem

1 Answers