I'm consuming stream from Kafka source for my flink job reading from 50 topics at once like this:
FlinkKafkaConsumer<GenericRecord> kafkaConsumer = new FlinkKafkaConsumer<GenericRecord>(
Pattern.compile("TOPIC_NAME[1-50].stream"), // getting data stream from all topics
<DeserializationSchema>, //using avro schema
properties); // auto.commit.interval.ms=1000 ...
And then there are some operators like: filter->map->keyBy->window->aggreagate->sink
The maximum throughput I'm able to get is from 10k to 20k records per second which is pretty low considering the source publishes hundreds of thousands of events and I can clearly see the consumer is lagging behind the producer. I've even tried to remove the sink and other operators to ensure there's no back pressure but it's still the same. I'm deploying my application to Amazon Kinesis data analytics and have tried several parallelism settings but none of those seem to improve the throughput.
Anything I'm missing?