Is the natural replacement for Spark (Direct) Streaming either Spark Structured Streaming or Kafka Streams?

Question

Over the past few years we have developed quite some Spark Streaming (Direct API) applications that are reading or writing to/from Kafka, IBM MQ, Hive, HBase, HDFS, and others on our Cloudera Platform. Now that the Direct API of Spark Streaming (we currently have version 2.3.2) is deprecated and we recently added the Confluent platform (comes with Kafka 2.2.0) to our project we plan to migrate these applications.

What is the natural replacement of our Spark Streaming applications? Should we migrate to Spark Structured Streaming or rather to Kafka Streams?

I personally do not have any experience with both frameworks but in my view Spark Structured Streaming seems to be the natural choice. Our code base is mainly written in Scala which could be also used for the Structured API. Kafka Streams has a few limitations with Scala. Although we might loose some flexibility by leaving the low level API of RDDs and moving to a higher level of DataFrames we could build on our knowledge with Spark.

On the other side there is Kafka Streams which is probably the best choice when it comes to processing data between Kafka topics which is our main use case. And looking at all the Kafka Connectors that come with Confluent the other uses cases can be served as well.

OneCricketeer OneCricketeer · Accepted Answer · 2020-02-20T05:10:15

You currently have some Spark scheduler, therefore you can use Structured Streaming, which is binary compatible with the old Streaming API.

If you're using Mesos or k8s, then putting Kafka Streams apps in Docker and running those is easier to scale, monitor and configure than Spark, IMO since it acts as any other Docker container in those systems, so you build a pattern around everything

Kafka Streams... is probably the best choice when it comes to processing data between Kafka topics

True.

Kafka Streams has a few limitations with Scala.

I think you might want to keep reading that section

The Kafka Streams DSL for Scala library is a wrapper over the existing Java APIs for Kafka Streams DSL that addresses the concerns raised

Of course you could always use Kotlin to interop better with the Java API

Is the natural replacement for Spark (Direct) Streaming either Spark Structured Streaming or Kafka Streams?

1 Answers