1
votes

We have an exiting batch processing which is working as mentioned below

  • Hive SQL is using for Daily batch processing.
  • Data are being either ingested from Files or RDMBS
  • Data is ingested in Raw --> Staging --> Mart, with staging to mart being all the business transformation and raw to staging is just cleansing and formatting of data.

Now as Part of getting real or near real time data, I am evaluating the Lambda Architecture and this is what plan is?

  • ALL the source system is going to land on Kafka.
  • Same batch processing System will consume Kafka topics.
  • New Spark Application will consume kafka topics for streaming.
  • Serving layer will create views which will combine both the aggregate data from Streaming and Batch for real (near real) time processing.

The problem is, the Logic will be duplicated in HiveQL (Batch) and Spark (Streaming). is there a way I can avoid this or minimize this?

2
If data comes from RDBMS, it suggests it is in NF3 at least. Do you need to do a lot of joins to get to usable business objects? And how often does the joined records are being updated? Could you provide a bit more info about the nature of the data? Is it append only data source or can records be Deleted and Updated in the source system. - alexeipab

2 Answers

1
votes

You can build your processing stages using Spark SQL and Spark Structured Streaming: https://spark.apache.org/docs/2.2.0/structured-streaming-programming-guide.html. Depending on your needs there can be some incompatibilities. But I´d try to build the Spark Aggregations + Transformations using the Dataset[_] api and then try to spawn in both ways, batch and streaming.

0
votes

The problem of duplicated code base is inherent in lambda architecture. It gets a mention in the 'criticism' section of the wikipedia page

Another issue is that the data between batch and stream are not in sync so can lead to unexpected results when bringing data together. For example, joining across stream and batch when keys do not yet exist in batch.

I believe the lambda architecture comes from an belief that streaming is complex and expensive so keep batch as much as possible and add streaming only for those elements that require near-real time. We already have batch, let's add a few streaming things.

An alternate architecture is to use streaming for everything. This is based on the realization that batch is a special case of streaming, so do your batch and stream processing on a single streaming platform.

use spark structured streaming for batch

lambda architecture issues and how only using streaming solves them

questioning the lambda architecture