1
votes

I have a redshift cluster with a single dc1.large node. I've got data writing into it, on order of 50 million records a day, in the format of a timestamp, a user ID and an item ID. The item ID (varchar) is unique, the user ID (varchar) is not, and the timestamp (timestamp) is not.

In my redshift DB of about 110m records, if I have a table with no sort key, it takes about 30 seconds to search for a single item ID.

If I have a table with a sort key on item ID, I get a single item ID search time of about 14-16 seconds.

If I have a table with an interleved sort key with all three columns, the single item ID search time is still 14-16 seconds.

What I'm hoping to achieve is the ability to query for the records of thousands or tens of thousands of item IDs on order of a second.

The query just looks like

select count(*) from rs_table where itemid = 'id123';

or

select count(*) from rs_table where itemid in ('id123','id124','id125');

This query comes back in 541ms

select count(*) from rs_table;

AWS documentation suggests that there is a compile time for queries the first time they're run, but I don't think that's what I'm seeing (and it would be not ideal if it was, since each unique set of 10,000 IDs might never be queried in exactly the same order again.

I have to assume I'm doing something wrong with either the sort key design, the query, or some combination of the two - for only ~10g of table space, something like redshift shouldn't take this long to query, right?

1

1 Answers

1
votes

Josh,

We probably need a few additional pieces of information to give you a good recommendation.

Here are some things to start thinking about.

  • Are most of your queries record lookups as you describe above?
  • What is your distribution key?
  • Do you join this table with other large fact tables?
  • If you load 50M records per day and you only have 110M records in the table, does that mean that you only store 2 days?
  • Do you do massive deletes and then load another 50M records per day?
  • Do you run ANALYZE after your loads?
  • If you deleted a large number of records, did you run VACUUM?

If all of your queries are similar to the ones that you describe, why are you using Redshift? Amazon DynamoDB or MongoDB (even Cassandra) would be great database choices for the types of queries that you describe.

If you run analytical workloads Redshift is an excellent platform. If you are more interested in "record lookups" the NoSQL options, as well as mysql or MariaDB might give you better performance.

Also, if this is a dev/test environment and you have loaded and deleted large amounts of data without ever running a VACUUM you would see significant performance degradation.