I'm relatively new to setting up PostgreSQL databases in general and AWS.
I'm trying to figure out what are the practical differences between Amazon RDS Instance Types. Some context about my org and use cases:
- We're primarily a small team (less than 5 users) of data analysts/data scientists
- We don't need 100% uptime. I'll likely be turning the database on during normal business hours and off afterwards. Our use case is primarily for analytics/data science purposes there will be no live applications running on top of it.
- We will need 1 database for now.
- The database will contain about 300 - 500 GB of data. Most tables are small (less than 1 GB). However, there are a few tables that are quite large. The largest table will be about 80GB and will be queried frequently. There are a few other tables that are around 10 GB.
I started with the Free Tier for now db.t2.micro
just to gain some experience, develop, and test out what I need. However, it's hard for me to understand how this compares to other instances. Our team is also very budget conscious right now so I'm trying to scope this out before advocating to upgrade.
Example operations that I'm trying to do
- Loading a table from CSVs: We have a lot of CSVs sitting on disk that I'll be loading into the PostgreSQL database. I just tested one right now, creating a table, and loading the CSV into the table (using
\copy
) of a 11 GB file (35,401,551 rows and 40 columns) took about 27 minutes. Would upgrading the instance help with that? If so, I'm not sure what I would want to upgrade to? - A
select count(*)
from that table took about 15 minutes (in the first run; in the second run it took about 3 minutes, probably because it's caching statistics? 3 minutes still seems slow to me.). Again, not sure if upgrading the instance would help with that? I should also note that I have no added primary keys or indices. So there might be other ways to optimize performance without upgrading the instance? For example, it might be possible that partitioning would be helpful: https://www.postgresql.org/docs/10/ddl-partitioning.html. The table has astate
column that is very often one of the first filters users use as often users only want to see one or a set of particular states.- A simple group by query:
select ColA, count(*) group by ColA
took about 2.5 minutes
- A simple group by query:
EDIT Just noting some other useful resources I've found:
- https://www.apptio.com/blog/ec2-m5-vs-t3/ on choosing between m5 vs t3