Stream many tables in realtime to Redshift: Bottleneck parallel concurrent load

Question

We recently released an open source project to stream data to Redshift in near realtime.

Github: https://github.com/practo/tipoca-stream

The realtime data pipeline stream data to Redshift from RDS.

Debezium writes the RDS events to Kafka.
We wrote Redshiftsink to sink data from Kafka to Redshift.

We have 1000s of tables which are streaming to Redshift, we use COPY command. We wish to load every ~10 minutes to keep the data as near realtime as possible.

Problem Parallel load becomes a bottleneck. Redshift is not good in ingesting data at such short interval. We do understand Redshift is not a realtime database. What is the best that can be done? Does Redshift plan to solve this in future?

Workaround that works for us! We have 1000+ tables in Redshift but we use not over 400 in a day. This is the reason we now throttle loads for the unused table when needed. This feature makes sure the tables which are in use are always near realtime and keep the Redshift less burdened. This was very useful.

Looking for suggestions from the Redshift community!

You are going to need to zoom in on the time the issues are seen. You are loading every 5 minutes so I assume that is how often a spike comes along. Is each 5 min load taking 5 minutes to complete? Then you will be always busy. What is your node type? If you can get 1min max and avg metrics we can compare these to the instance spec. Same thing with the network bandwidth. — Bill Weiner
Redshift is an analytic powerhouse but data needs to be organized and analyzed to achieve peak performance. The rate at which you are adding data is doable with Redshift but tricky. Your data loads need to be well organized and your UPSERT queries well written. Data distribution and sort order needs to be well thought out. At some point you just cannot push data any faster into Redshift as it is not a real-time database. To get what you have to work better we need to find what limitation is being hit and improve along that axis. — Bill Weiner
Good to hear and good work tracking down this network bandwidth issue. This is a common issue (S3 bandwidth on COPY) and I can give you some general advice. I believe that your cluster has 4 slices by default, so 4 independent connections to S3 can be initiated in parallel. This is a big part of the performance speed up you are seeing. Each slide can load some of the S3 files and things run in parallel. Now it looks to me like your numbers are a bit off - 60 X 200kB = 12MB, not really that big. Can you confirm it shouldn't be 200MB files? — Bill Weiner
Now it is critical to compress the files in S3. Moving uncompressed data over a network that is your limiter is not the best approach. If your files are not compressed in S3 you can likely improve bandwidth by 2X or more. — Bill Weiner
Lastly don't go overboard on making lots of small S3 objects. It takes S3 in the order of .5sec to look up the object you want and start returning the data. So you want the transfer time of the data to be much larger than this object look up time. You should be able to get at least 10bT bandwidths between these services so a 1MB file will take about 1sec to transfer - a 33% overhead, not good. 100MB files will be .33$, good. This is also why I think you units may be off . — Bill Weiner

Max Ganz II Max Ganz II · Accepted Answer · 2021-04-28T07:47:54

I may be wrong, but I am of the view this use case is absolutely inappropriate for Redshift, for such a wide range of reasons I will here stick to only one or two.

What are you doing about VACUUM on these thousands of tables?
Redshift has an absolute maximum of 50 concurrent queries. A more typical configuration is for 10. Redshift is not designed to run so many queries.
I guess each load you have is for a small number of rows. Be aware that all disk I/O is in one megabyte (base 2) blocks, and each table is of course broken up into columns, and these columns are present on every slice. If you load say a single row into a table with say twenty columns, you will load to a single slice, and if that load is not in a transaction, yoo will perform (I think - it could be more) two megabytes of disk I/O per column, so 40 megabytes of disk I/O - for one row. Imagine now you have a few rows, more than the number of slices; you will be performing 40 megabytes of disk I/O per slice. Redshift is staggeringly inefficient at loading small volumnes of data, because of the design choices which must be taken to support timely SQL on Big Data.

I may be wrong, but I think you are treating Redshift as if it were a conventional unsorted row-store database, like Postgres. Redshift is utterly different : it offers and only offers one service, which is, that when and only when correctly used, you will have timely SQL on Big Data. There is nothing else. No tons of users, no tons of queries, no rapid or small data.

Right now I would expect the vast majority of your Redshift spend is going on inefficiency. You are likely when paying for gigabytes of disk I/O to be process only hundreds of kilobytes of actual data. You should use a database appropriate to your use case where you only need to pay for hundreds of kilobytes of disk I/O to process hundreds of kilobytes of actual data.

Stream many tables in realtime to Redshift: Bottleneck parallel concurrent load

1 Answers