I have a redshift cluster that I use for some analytics application. I have incoming data that I would like to add to a clicks
table. Let's say I have ~10 new 'clicks' that I want to store each second. If possible, I would like my data to be available as soon as possible in redshift.
From what I understand, because of the columnar storage, insert performance is bad, so you have to insert by batches. My workflow is to store the clicks in redis, and every minute, I insert the ~600 clicks from redis to redshift as a batch.
I have two ways of inserting a batch of clicks into redshift:
Multi-row insert strategy
: I use a regularinsert
query for inserting multiple rows. Multi-row insert documentation hereS3 Copy strategy
: I copy the rows in s3 asclicks_1408736038.csv
. Then I run aCOPY
to load this into theclicks
table. COPY documentation here
I've done some tests (this was done on a clicks
table with already 2 million rows):
| multi-row insert stragegy | S3 Copy strategy |
|---------------------------+---------------------------+
| insert query | upload to s3 | COPY query |
-------------+---------------------------+--------------+------------+
1 record | 0.25s | 0.20s | 0.50s |
1k records | 0.30s | 0.20s | 0.50s |
10k records | 1.90s | 1.29s | 0.70s |
100k records | 9.10s | 7.70s | 1.50s |
As you can see, in terms of performance, it looks like I gain nothing by first copying the data in s3. The upload
+ copy
time is equal to the insert
time.
Questions:
What are the advantages and drawbacks of each approach ? What is the best practise ? Did I miss anything ?
And side question: is it possible for redshift to COPY
the data automatically from s3 via a manifest ? I mean COPYing the data as soon as new .csv
files are added into s3 ? Doc here and here. Or do I have to create a background worker myself to trigger the COPY commands ?
My quick analysis:
In the documentation about consistency, there is no mention about loading the data via multi-row inserts. It looks like the preferred way is COPY
ing from s3 with unique object keys (each .csv
on s3 has its own unique name)...
S3 Copy strategy
:- PROS: looks like the good practice from the docs.
- CONS: More work (I have to manage buckets and manifests and a cron that triggers the
COPY
commands...)
Multi-row insert strategy
- PROS: Less work. I can call an
insert
query from my application code - CONS: doesn't look like a standard way of importing data. Am I missing something?
- PROS: Less work. I can call an