Google Cloud DataFlow for NRT data application

Question

I'm evaluating Kafka/Spark/HDFS for developing NRT (sub sec) java application that receives data from an external gateway and publishes it to desktop/mobile clients (consumer) for various topics. At the same time the data will be fed through streaming and batching (persistent) pipelines for analytics and ML.

For example the flow would be...

A standalone TCP client reads streaming data from external TCP server
The client publishes data for different topics based on the packets (Kafka) and passes it to the streaming pipeline for analytics (Spark)
A desktop/mobile consumer app subscribes to various topics and receives NRT data events (Kafka)
The consumer also receives analytics from the streaming/batch pipelines (Spark)

Kafka clusters have to be managed, configured and monitored for optimal performance and scalability. This may require additional person resources and tools to manage the operation.

Kafka, Spark and HDFS can optionally be deployed over Amazon EC2 (or Google Cloud using connectors).

I was reading about Google Cloud DataFlow, Cloud Storage, BigQuery and Pub-Sub. The data flow provides auto scaling and tools to monitor data pipelines in real-time, which is extremely useful. But the setup has a few restrictions e.g. pub-sub push requires the client to use https endpoint and the app deployment needs to use web server e.g. App engine webapp or web server on GCE.

This may not be as efficient (I'm concerned about latency when using http) as deploying a bidirectional tcp/ip app that can leverage the pub-sub and data flow pipelines for streaming data.

Ideally, the preferable setup on Google cloud would be to run the TCP client connecting to the external gateway deployed on GCE that pushes data using pub-sub to the desktop consumer app. In addition, it would leverage the DataFlow pipeline for analytics and cloud storage with spark for ML (prediction API is a bit restrictive) using the cloudera spark connector for data flow.

One could deploy Kafka/Spark/HDFS etc on Google cloud but that kinda defeats the purpose of leveraging the Google cloud technology.

Appreciate any thoughts on whether the above setup is possible using Google cloud or stay with EC2/Kafka/Spark etc.

Kamal Aboul-Hosn Kamal Aboul-Hosn · Accepted Answer · 2016-02-16T22:47:52

Speaking about the Cloud Pub/Sub side, there are a couple of things to keep in mind:

If you don't want to have to have a web server running in your subscribers, you could consider using the pull-based subscriber instead of the push-based one. To minimize latency, you want to have at least a few outstanding pull requests at any time.
Having your desktop consumer app act as a subscriber to Pub/Sub directly will only work if you have no more than 10,000 clients; there is a limit of 10,000 subscriptions. If you need to scale beyond that, you should consider Google Cloud Messaging or Firebase.

Google Cloud DataFlow for NRT data application

2 Answers