Where should fanout occur in an ingestion system?

Question

I'm trying to design an IoT system that will have many IoT devices sending different types of sensor data to a front-end load balancing server that then sends the messages to an ingestion system (currently thinking Google Cloud PubSub). Then there are consumers that consume the messages and write them to different databases and tables. Each sensor data type has its own database.

Where should fanout happen?

BEFORE pubsub system: If the frontend does the fanout, then it has to be scaled big enough to have enough processing power to look at the content of each message to figure out which topic to send it end. Then I will have a separate topic for each message and a consumer for each topic.
AFTER pubsub system: If I only have a single topic that the frontend just shoves all messages into regardless of their type, then that topic's consumer needs to be scaled to be able to consume and process each message to determine which database to write to. It would also mean that this one consumer code needs to have access to all the databases.
INSIDE pubsub system: Have pubsub do the fanout, so that even though publishers only publish to one topic, there are several subscriptions for that topic (one for each data type), and each consumer consumes from their own subscription and drops all the messages that are the datatype they are meant to consume. It seems like Kafka might be a better use for this.

#3 is a good fit for Kafka. There is a Kafka Connector or Google PubSub so you can even use both and route PubSub messages into Kafka topics (or vice versus) without any coding. See github.com/GoogleCloudPlatform/pubsub/blob/master/… — Hans Jespersen

Emilio Schapira Emilio Schapira · Accepted Answer · 2017-08-02T16:00:04

For connecting devices and routing data through Pub/Sub you will be better served by IoT Core, which also handles device connectivity, authentication and monitoring. Since configuring device connectivity is more complex, it would be better to fan-out at the server. Right now your best option is to have a consumer for the Pub/Sub system which writes to each storage. If you want to decouple consumers, you can have multiple ones for each storage, but you will need to "drop" messages that are not needed. Dataflow can be used for this with Apache Beam I/O Transforms.

Where should fanout occur in an ingestion system?

1 Answers