2
votes

I am planning to use nifi to ingest data from more than 10,000 sensors. There are 50-100 types of sensors which will send a specific metric to nifi.

I am pondering over whether I should assign 1 port number to listen to all the sensors or I should assign 1 port for each type of sensor to facilitate my data pipeline. which is the better option?

Is there a upper limit of the no of ports which I can "listen" using nifi?

2

2 Answers

2
votes

@ilovetolearn

NiFi is such a powerful tool. You can do either of your ideas, but I would recommend to do what is easier for you. If you have data source sensors that need different data flows, use different ports. However, if you can fire everything at a single port, I would do this. This makes it easier to implement, consistent, easier to support later, and easier to scale.

In large scale highly available NiFi, you may want a Load Balancer to handle the inbound data. This would push the sensor data toward a single host:port on the LB appliance, that then directs to NiFi with 3-5-10+ nodes.

1
votes

I agree with the other answer that once scaling comes into play, an external load balancer in front of NiFi would be helpful.

In regards to the flow design, I would suggest using a single exposed port to ingest all the data, and then use RouteOnAttribute or RouteOnContent processors to direct specific sensor inputs into different flow segments.

One of the strengths of NiFi is the generic nature of flows given sufficient parameterization, so taking advantage of flowfile attributes to handle different data types dynamically scales and performs better than duplicating a lot of flow segments to statically handle slightly differing data.

The performance overhead to run multiple ingestion ports vs. a single port and routed flowfiles is substantial, so this will give you a large performance improvement. You can also organize your flow segments into hierarchical nested groups using the Process Group features, to keep different flow segments cleanly organized and enforce access controls as well.

2020-06-02 Edit to answer questions in comments

Yes, you would have a lot of relationships coming out of the initial RouteOnAttribute processor at the ingestion port. However, you can segment these (route all flowfiles with X attribute in "family" X here, Y here, etc.) and send each to a different process group which encapsulates more specific logic.

Think of it like a physical network: at a large organization, you don't buy 1000 external network connections and hook each individual user's machine directly to the internet. Instead, you obtain one (plus redundancy/backup) large connection to the internet and use a router internally to direct the traffic to the appropriate endpoint. This has management benefits as well as cost, scalability, etc.

The overhead of multiple ingestion ports is that you have additional network requirements (S2S is very efficient when communicating, but there is overhead on a connection basis), multiple ports to be opened and monitored, and CPU to schedule & run each port's ingestion logic.

I've observed this pattern in practice at scale in multinational commercial and government organizations, and the performance improvement was significant when switching to a "single port; route flowfiles" pattern vs. "input port per flow" design. It is possible to accomplish what you want with either design, but I think this will be much more performant and easier to build & maintain.