How to avoid Code Redundancy in Lambda Architecture?

Question

We have an exiting batch processing which is working as mentioned below

Hive SQL is using for Daily batch processing.
Data are being either ingested from Files or RDMBS
Data is ingested in Raw --> Staging --> Mart, with staging to mart being all the business transformation and raw to staging is just cleansing and formatting of data.

Now as Part of getting real or near real time data, I am evaluating the Lambda Architecture and this is what plan is?

ALL the source system is going to land on Kafka.
Same batch processing System will consume Kafka topics.
New Spark Application will consume kafka topics for streaming.
Serving layer will create views which will combine both the aggregate data from Streaming and Batch for real (near real) time processing.

The problem is, the Logic will be duplicated in HiveQL (Batch) and Spark (Streaming). is there a way I can avoid this or minimize this?

If data comes from RDBMS, it suggests it is in NF3 at least. Do you need to do a lot of joins to get to usable business objects? And how often does the joined records are being updated? Could you provide a bit more info about the nature of the data? Is it append only data source or can records be Deleted and Updated in the source system. — alexeipab

Emiliano Martinez Emiliano Martinez · Accepted Answer · 2018-09-19T08:24:38

You can build your processing stages using Spark SQL and Spark Structured Streaming: https://spark.apache.org/docs/2.2.0/structured-streaming-programming-guide.html. Depending on your needs there can be some incompatibilities. But I´d try to build the Spark Aggregations + Transformations using the Dataset[_] api and then try to spawn in both ways, batch and streaming.

How to avoid Code Redundancy in Lambda Architecture?

2 Answers