We have an exiting batch processing which is working as mentioned below
- Hive SQL is using for Daily batch processing.
- Data are being either ingested from Files or RDMBS
- Data is ingested in Raw --> Staging --> Mart, with staging to mart being all the business transformation and raw to staging is just cleansing and formatting of data.
Now as Part of getting real or near real time data, I am evaluating the Lambda Architecture and this is what plan is?
- ALL the source system is going to land on Kafka.
- Same batch processing System will consume Kafka topics.
- New Spark Application will consume kafka topics for streaming.
- Serving layer will create views which will combine both the aggregate data from Streaming and Batch for real (near real) time processing.
The problem is, the Logic will be duplicated in HiveQL (Batch) and Spark (Streaming). is there a way I can avoid this or minimize this?