What I'm trying to do
For a university project i am are trying to compare Apache Flink and Apache Kafka Streaming Performance (Throughput, Latency) using different configurations (1 Nodes, 2 Nodes, 4 Nodes, changing amount of CPU cores etc.).
For this purpose i have created a twitter JSON dataset containing ~ 15.000 Tweets, each tweet delimited by a newline.
The Problem
As far as i know Kafka follows the pattern "Producer - Kafka Cluster/Brokers - Consumer" and, for benchmarking the latency, i would measure the time between Producer and Consumer for each record.
The problem is, as far as i can tell Apache Flink lacks the ability of the Producer pattern - it seems like i have to specify a source for the data stream which the TaskManagers ("Consumers") would then fetch and process.
This makes it hard for me to tell how i should benchmark both systems in a comparable way because, for the latency measurement, i would measure the time between Producer and Consumer whereas in Flink i would have to measure the time between the JobManager and TaskManagers. So the producer part would be missing here.
Assuming i haven't misunderstood something, how would i measure both systems in a comparable way in order to make reasonable judgements?