We recently released an open source project to stream data to Redshift in near realtime.
Github: https://github.com/practo/tipoca-stream
The realtime data pipeline stream data to Redshift from RDS.
- Debezium writes the RDS events to Kafka.
- We wrote Redshiftsink to sink data from Kafka to Redshift.
We have 1000s of tables which are streaming to Redshift, we use COPY command. We wish to load every ~10 minutes to keep the data as near realtime as possible.
Problem Parallel load becomes a bottleneck. Redshift is not good in ingesting data at such short interval. We do understand Redshift is not a realtime database. What is the best that can be done? Does Redshift plan to solve this in future?
Workaround that works for us! We have 1000+ tables in Redshift but we use not over 400 in a day. This is the reason we now throttle loads for the unused table when needed. This feature makes sure the tables which are in use are always near realtime and keep the Redshift less burdened. This was very useful.
Looking for suggestions from the Redshift community!