I have an architecture dilemma for my current project which is for near realtime processing of big amount of data. So here is a diagram of the the current architecture:
Here is an explanation of my idea which led me to that picture:
When the API gateway receives a request it's put in the stream(this is because of the nature of my application- "fire and forget) That's how I came up to that conclusion. The input data is separated in the shards based on a specific request attribute which guarantees me the correct order.
Then I have a lambda which cares for validating the input and anomaly detection. So it's an abstraction which keeps the data clean for the next layer- the data enrichment. So this lambda sends the data to a kinesis firehose because it can backup the "raw" data(something which I definitely want to have) and also attach a transformation lambda which will do the enrichment- so I won't care for saving the data in S3, it will come out of the box. So everything is great until the moment where I need a preserved ordering of the received data(the enricher is doing sessionization), which is lost in the firehose, because there's no data separation there as it's in the kinesis streams.
So the only thing I could think of is- to move the sissionization in the first lambda, which will break my abstraction, because it will start caring about data enrichment and the bigger drawback is that the backup data will have enriched data in it, which is also breaking the architecture. And all this is happening because the missing sharding conception in the firehose.
So can someone think of a solution of that problem without losing the out of the box features which aws provides us?