3
votes

Many AWS reference architectures for serverless real-time analytics, suggest pushing processed data from Lambda to S3 through Kinesis Firehose.

e.g. https://aws.amazon.com/blogs/big-data/create-real-time-clickstream-sessions-and-run-analytics-with-amazon-kinesis-data-analytics-aws-glue-and-amazon-athena/

Why can’t we push data from Lambda to S3 directly? Isn't it better to avoid complexity and additional cost by skipping the mediator Kinesis Firehose component? Is there any problem with writing real-time data by Lambda directly to S3?

1

1 Answers

5
votes

Mainly because Firehose enables you to batch the data. It will e.g. only write files of 128mb of data gzipped into S3. It will collect incoming data until a threshold is reached, write it to S3 and wait for the next data. If you let the lambda write to S3 directly then you would have to do the batching yourself, which is pretty difficult if you only have state-less lambdas.

That being said this mainly applies if your data consists of MANY records / rows. If on the other hand you are basically dealing with blobs of lets say 50MB of data that your lambda outputs then you can / should write to S3 directly because the batching may not be possible or useful in your case.

Wether or not you should use firehose simply depends on what data / throughput you have and what requirements there may be.

One problem of writing real time data to S3 directly is that if you want to e.g. query it with Athena you will get into a lot of trouble if you have millions of files a few bytes large instead of 100s of files 10s of MB large.