0
votes

I have a stream of events needs to be enriched with subscription information. Some of events are broadcasting event, means that when such events are received, I need to go the database table, find all the subscribers of the event, it can be 10,000 rows in my use case, and then transform the single broadcast event to 10,000 notification events. For normal event type, there's additional user_id key can be used to join the subscription table, which does not have the issue.

The challenges are

  • how to join a large ResultSet, return them to memory doesn't seem like a scalable solution. Is there a way to partition this into many smaller parallel tasks?
  • how can I organize the processing pipeline such that normal event and broadcasting event are not interfering each other. I don't want consecutive long running broadcasting events to block the processing pipeline of normal events.

I'm just getting started with Flink, what would be the correct or performant architecture for this use case? If needed, the broadcast event type and normal event type can be separated into two sources.

1

1 Answers

0
votes

Ideally, you can provide the secondary information (database table) as an additional input to Flink and then simply use a join. That is only viable if the information can be fetched by a Flink connector. The advantage is that if you do it correctly, even updates on the table get reflected in the output appropriately. You also don't need to care about the result size as that will be automatically handled by Flink.

Alternatively, you can use asyncIO, which is in particular made to interact with external systems. The downside of asyncIO is that currently all results of all active requests have to fit into main memory. But that should be viable for 10_000 rows, especially since the respective events seem to occur rather seldom.