4
votes

I'm pretty new to AWS, and I'm trying to find a way to reliably transfer data from a Kinesis stream to an AWS RDS postgres database table. The records will need to undergo small transformations on the way in, like filter (not all records will be inserted, depending on a key), and parsed for an insert into postgres. Currently, the data from the Kinesis stream is being dumped by firehose into S3 buckets as parquet.

I'm a bit lost in the many possible ways there seems to be of doing this, like maybe:

  • Kinesis streams -> Firehose -> Lambda -> RDS
  • Kinesis streams -> Firehose -> S3 -> Data Pipeline ETL job -> RDS
  • Database migration for S3 -> RDS?
  • AWS Glue?
  • others...?

In a non serverless world, I would run a chron job every, say, one hour which would take the files in the most recent S3 bucket partition (which is year/month/day/hour), so the latest hour, and filter out the records not needed in RDS, and do a bulk insert the rest into the RDS. I don't want to have a EC2 instance that sits idle 95% of the time to do this. Any advice?

1
Hi, good point but before asking, I have some questions. What is the volume of your data to filter and insert ? Which method you would like to use to insert the data into RDS ? A sequence of INSERT INTO, a batch load ? What is the data freshness you want ?Bartosz Konieczny
not a lot at the moment, a few mb at a time. Either a sequence of INSERT INTO or a batch load is OK. It doesn't need to be low latency, it can take a few minutes. I'd be interested in low or high latency ways though, I'm just trying to understand the different ways this appears possible.bill_e

1 Answers

1
votes

Thanks for the clarification. Doing it in traditional ETL way with servers has some drawbacks. Either you'll need to keep a machine idle most of the time or you'll need to wait every time before the machine is created on demand - exactly as you're saying.

For Firehose, IMO it's interesting when you have a lot of real-time data to ingest. Regarding to AWS Glue, for me it's more like a "managed" Apache Spark, hence if you have some data processing logic to implement in a big amount of batch data, it can be interesting. But according to your description, it's not the case right ?

To sum up, if you think the amount of inserted data will always be still a few mb at a time, for me the simplest solution is the best, i.e. Kinesis -> Lambda -> RDS with maybe another Lambda to backup data on S3 (Kinesis retention period is limited to 7 days). It's especially interesting from the pricing point of view - apparently you have not a lot data, Lambda is executed at demand, for instance by batching 1000 Kinesis records, so it's a good occasion to save some money. Otherwise, if you expect to have more and more data, the use of "Firehose -> Lambda" version seems t be a better fit for me because you don't load the database with a big amount of data at once.