3
votes

I'm relatively new to setting up PostgreSQL databases in general and AWS.

I'm trying to figure out what are the practical differences between Amazon RDS Instance Types. Some context about my org and use cases:

  • We're primarily a small team (less than 5 users) of data analysts/data scientists
  • We don't need 100% uptime. I'll likely be turning the database on during normal business hours and off afterwards. Our use case is primarily for analytics/data science purposes there will be no live applications running on top of it.
  • We will need 1 database for now.
  • The database will contain about 300 - 500 GB of data. Most tables are small (less than 1 GB). However, there are a few tables that are quite large. The largest table will be about 80GB and will be queried frequently. There are a few other tables that are around 10 GB.

I started with the Free Tier for now db.t2.micro just to gain some experience, develop, and test out what I need. However, it's hard for me to understand how this compares to other instances. Our team is also very budget conscious right now so I'm trying to scope this out before advocating to upgrade.

Example operations that I'm trying to do

  1. Loading a table from CSVs: We have a lot of CSVs sitting on disk that I'll be loading into the PostgreSQL database. I just tested one right now, creating a table, and loading the CSV into the table (using \copy) of a 11 GB file (35,401,551 rows and 40 columns) took about 27 minutes. Would upgrading the instance help with that? If so, I'm not sure what I would want to upgrade to?
  2. A select count(*) from that table took about 15 minutes (in the first run; in the second run it took about 3 minutes, probably because it's caching statistics? 3 minutes still seems slow to me.). Again, not sure if upgrading the instance would help with that? I should also note that I have no added primary keys or indices. So there might be other ways to optimize performance without upgrading the instance? For example, it might be possible that partitioning would be helpful: https://www.postgresql.org/docs/10/ddl-partitioning.html. The table has a state column that is very often one of the first filters users use as often users only want to see one or a set of particular states.
    1. A simple group by query: select ColA, count(*) group by ColA took about 2.5 minutes

EDIT Just noting some other useful resources I've found:

  1. https://www.apptio.com/blog/ec2-m5-vs-t3/ on choosing between m5 vs t3
1

1 Answers

1
votes

Primarily with instance types within RDS the key differences are as follows:

  • Available Memory
  • Available CPU
  • Maximum Network Throughput
  • Burstable vs non-burstable (T instance family uses burstable credits, once this is depleted your CPU is capped).

As you're using a T2.micro you only have a single CPU available. Once the credits are depleted you will only be capped at 20% maximum CPU too which will limit what you're able to do.

I would also note if you are able to invest more in the AWS platform there is a dedicated data warehouse solution for analytics which is Redshift, this is likely to perform better but will cost more, its also forked from Postgres.

If sticking to Postgres in RDS is what you're looking to do you should also consider Aurora Postgres, it generally provides 2-3 times the performance of regular Postgres on RDS.