1
votes

I am having a hard time understanding the differences between GCP Dataflow/Apache Beam and Spring Cloud Dataflow. What I am trying to do is move to a more cloud-native solution for streaming data processing, so our developers can focus more on developing core logic rather than managing infrastructure.

We have an existing streaming solution that consists of spring cloud dataflow 'modules' that we can iterate on and deploy independently, much like microservices, which works great, but we are looking to migrate into an existing platform in GCP provided by our business that requires us to use GCP Dataflow. At a high level, the solution is simply:

Stream 1:

Kafka Source (S0) -> Module A1 (Ingest) -> Module B1 (Map) -> Module C1 (Enrich) -> Module D1 (Split) -> Module E1 (Send output to Sink S1)

Stream 2:

Kafka Source (S1) -> Module A2 (Ingest) -> Module B2 (Persist to DB) -> Module B3 (Send Notifications through various channels)

The solution we would like to move to, from what I am understanding, should be identical, however the modules would become GCP Dataflow modules, and the source/sink would become GCP Pub/Sub rather than kafka.

Most of the documentation I have come across does not compare SCDF and Apache Beam (GCP Dataflow modules) as similar solutions, so I am wondering how/if it would be possible to port our existing logic over to an architecture like that.

Any clarification would be greatly appreciated. Thanks in advance.

2

2 Answers

8
votes

I want to add +1 to @guillaume-blaquiere's response re: "rewriting code". Let me also add a bit more color on this subject.

SCDF Overview:

Spring Cloud Data Flow (SCDF), at its core, is nothing but a RESTful service, a lightweight Spring Boot application even. That's all. It doesn't require a runtime to run; the Boot App can run wherever there's Java, including in your laptop or any container platform (k8s, cf, nomad, etc.), on any cloud (aws, gcp, azure, etc.,).

SCDF (Boot App / Über-jar) ships with a Dashboard, Shell/CLI, and APIs, so developers and operators can use them to design and deploy streaming or batch data pipelines.

The data pipelines in SCDF are made of Spring Cloud Stream or Spring Cloud Task applications. Because these are standalone and autonomous microservice applications, users can patch or rolling-upgrade individual applications in isolation, without impacting the upstream or downstream applications in the data pipeline — more details about the architecture here.

SCDF vs. GCDF:

Spring Cloud Stream and Spring Cloud Task are roughly comparable to Apache Beam. These are SDK/libraries.

SCDF, on the other hand, it has some similarities to data pipelines in Google Cloud Dataflow (GCDF), but the modules in GCDF are expected to be built using Apache Beam. In other words, you cannot run the Spring Boot streaming/batch microservices running in SCDF to GCDF as modules - you will have to rewrite them using Apache Beam APIs. Both SCDF and GCDF foundation relies directly on the respective beforementioned framework and its capabilities.

The other important distinction to highlight is the runtime component that's required to run Apache Beam modules. In GCDF, all the runtime/runner expectation is hidden from the users, because it is managed by GCP. Whereas, in SCDF, to run it in a highly scalable manner, you'd choose the platform of your choice. SCDF runs as a native container application on the platform.

SCDF on GKE/GCP:

You could provision a GKE cluster on GCP and use the SCDF's Helm Chart to run SCDF and your current streaming/batch (Spring Boot) apps natively in Kubernetes as a managed service on GCP.

3
votes

First to clarify: Spring Cloud Data Flow is totally different than GCP Dataflow.

Spring Cloud Data Flow is comparable to Apache Beam. There are both a framework for describing data transformation, like an ETL.

GCP Dataflow is an auto-scalable and managed platform hosted on GCP. It accepts a processing flow described with Apache Beam Framework. GCP Dataflow is in charge to run the pipeline, to spawn the number of VM according with the pipeline requirement, to dispatch the flow to these VM,...

Apache Beam is an open source project with many connector. A lot on GCP (because it's originally a Google product which has been open sourced), but also other connectors, like a kafka io connector.

Beam also integrate different runner: DirectRunner for lauching the pipeline on your current machine, DataflowRunner for running it on GCP Dataflow, SparkRunner for running it on Hadoop Spark cluster,...

It's a great solution, but without any direct relation, compliance, portability with Spring Cloud Data Flow. You have to rewrite your code for passing from one to the other.

Hope this help your understanding